OpenAI Spinning Up in Deep RL Workshop | OpenAI

🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance

Video

Transcript

Hello, and good morning everyone!

Hi! I’m Josh Achiam, I’m a Safety researcher

here at OpenAI and I’m the main author of Spinning Up in Deep RL up and thank you

all so much for being here today at OpenAI’s 1st Spinning Up Workshop.

oFor people who are tuning in on the livestream I’d like to let you know that

there is a minor technical difficulty and so we will not be able to broadcast

the slides directly from my computer into the livestream video so you’ll be

seeing the screen through the camera. In the event that that’s not enough for you

to see it clearly I just open sourced the repo that has the PDFs for these

slides so please go to github.com slash open AI slash spinning up - workshop and

you’ll find in the RL intro folder RL intro PDF which will be the presentation

that I’m about to give so hopefully that makes it easier for you to follow along

so since this is kind of a new thing that we’re doing I’d like to start today

by talking about what it is and why we’re doing it and what we hope you get

out of it from being here education at open AI is this concept that as part of

our mission we want to make sure that we provide for the public good and that we

help foster a global community around AGI which is the thing that at opening I

we care the most about and are trying to figure out how to make sure happens in a

way that’s safe and beneficial for all of humanity so for those of you who

aren’t already familiar AGI is artificial general intelligence the idea

is that this is going to be some very powerful AI technology that’ll have the

ability to change pretty much everything about how we do anything something that

could potentially do most economically valuable work something that could solve

tasks that currently only human intelligence is capable of solving and

so we think it’s really important that we help people become aware of what AGI

is and what the technology that’ll likely underlie it is so that you can

think critically about issues that might come up in the future and also if you’re

interested participate because we really need for

people to step up and help make sure that this technology is safe and does

what we wanted to do and doesn’t cause anything harmful or detrimental to the

world so spinning up is the first thing that we’re launching under this

education at open AI initiative and the goal is to help people acquire technical

skills in the research topics that we care about so spinning up in deep RL is

a resource that hopefully all of you have seen by now it contains a number of

different pieces including a short intro to reinforcement learning so what is

this thing that we’re doing so much research about at open AI an essay about

how you would go about becoming a researcher if you’re interested in

joining a curated list of important papers in the field so this is

particularly important because since this is an emerging field there isn’t

really a clear consensus on the best way to learn it or a textbook that

completely illuminates the way from start to finish and a lot of the

important knowledge right now is still in research papers so if you want to

find out the most stuff about this you have to go digging and hopefully this

helps you figure out where to look also a code repo of key algorithms because

for any of you who have tried hacking in this field before I’m sure you found

that there were a lot of very confusing resources out there really excellent

ones but nonetheless ones that made non-obvious choices and didn’t clearly

connect what they were doing to why they were doing it and so we hope that the

repo that we provide and spinning up in deep RL is part of something to bridge

the gap there and of course some exercises so if you want to actually try

coding something up there are a few ideas there for for what to do to get

you familiar with some of the key pieces of math or algorithms or what kind of

bugs you might expect and so why are we having workshops so in addition to

putting these resources online we think it’s gonna really help people if we work

with you one-on-one if we can see you face-to-face and talk with you and have

the kind of conversations and share the ideas that just you know don’t come up

in the sort of open loop control thing that happens when we put information on

the Internet today we’d like to have you come away

from this with a better sense of what the current capabilities and limitations

are in deeper I’ll tell you a little bit about what kind of research is out there

so if you want to go and follow some line of thinking you know what’s been

done and what hasn’t and we’d like you to actually try building and running

algorithms for deep reinforcement learning for possibly for the first time

and show you how to be confident in doing that so that if you want to keep

doing it afterwards you’re able to all right so then what is deep reinforcement

learning why do we need it why do we care about it

deep reinforcement learning is the combination of reinforcement learning

with deep learning RL reinforcement learning is about solving problems by

trial and error and deep learning is about using these very powerful function

approximate errs called deep neural networks to solve problems and deep

reinforcement learning is just straightforwardly the combination where

we’re gonna have something that’s learning by trial and error and the

thing that’s getting learned is a deep neural network that’s going to make some

kind of decision or evaluate some situation and use that ultimately to in

some environment make decisions that lead to rewards where reward is just

some measure of how good or bad an outcome was so when would you want to

use RL RL is useful when for one there’s a sequential decision making problem -

you don’t know what the right thing to do in that situation is already if you

have the optimal behavior say from having watched human experts enough and

you have just a ton of data on exactly what to do in every situation then you

can use the standard tools of say supervised learning to exactly get some

machine learning system to duplicate that behavior but when you don’t have

access to that or when you suspect that what appears to be expert human behavior

is in fact suboptimal in that situation you may want to try

reinforcement learning instead because it could discover

things that wouldn’t have otherwise been known and you also have to be able to

evaluate whether or not a behavior or an outcome was good or bad this is pretty

critical so RL is good when it’s easier to

evaluate behaviors than to generate them or to exactly solve for them and when

would you use deep learning so the typical paradigm for deep

learning is that you want to approximate some very complicated function a

function that usually requires some amount of intelligence so for instance

if a human looks at a picture of a bird and then knows what species of bird that

is that’s a thing that you can’t really write down a simple mathematical rule to

do if you want to get a machine to do that you have to teach it from data and

other problems that you know you would want to do this for typically have

inputs or outputs that are very high dimensional because it’s just quite hard

to from an image or from a video stream or from an audio stream go to a decision

rule without doing some sort of learning in the middle and also you typically

want to have lots and lots of data because getting machine learning systems

to behave in any reasonable way requires that you give them sufficient examples

and there are tons of problems where this is exactly what you have and in

those domains deep learning has been very successful at exceeding whatever

was previously the state of the art from any other methods that existed before

and creating things that are now standard consumer products things that

were magic 10 years ago are like completely normal now the idea that we

have super excellent image recognition facial classification that you can talk

to your phone and it’s going to know what you said and it’s not just going to

come up with some completely random gobbledygook

this is getting better because we’re able to leverage this very powerful

technology that is deep learning for these problems and so deep RL is when

you have some very hard high dimensional problem where you can evaluate behaviors

and you want to get a machine to learn how to do it because you can’t write

down how it should in fact behave and some very simple examples of this are

say video games we you want to go from a computer looking

at an image of the screen so just raw pixels to a decision rule that scores

the most possible points in the game or behaves in a way which is cool or

interesting or exciting or perhaps a really sophisticated strategy game like

go we’re really deep thinking and intuition and creativity is necessary to

make progress you can’t write down a simple rule for that but you can learn

it with reinforcement learning or perhaps you want to control some complex

humanoid some some robot to run around and do stuff

or maybe something which is a little less silly maybe a little more real

maybe you want to get robots in a factory to quickly learn a new task when

the robot uprising happens it’s because of this we’re very sorry for this

research this was trained by the way with an algorithm that was developed

here at open a I called proximal policy optimization it’s one of the algorithms

and spinning up and if you haven’t had any experience with it then we won’t get

into it in this lecture today but any other point in the afternoon during the

hackathon happy to go into detail so before we proceed into the are L

specific stuff this is a crowd with a pretty wide range of backgrounds and so

I just want to do a very brief recap of some of the patterns from deep learning

what do you expect when you set up a deep learning problem what does that

look like what do you have to think about so we typically talk about it in

terms of the language of finding a model that is able to give the right outputs

for certain inputs so in this case the model is going to be some function of

the inputs and parameters and the parameters are adjustable we control

them we change them and we want to change them in a way that’s going to

make the model behave according to some design specification the way that we

provide the design specification and get the parameters to satisfy it is by

setting up some kind of loss function this tells you in a nutshell how good

the model is at doing the thing that you want it to do usually some measure of

just how close the output from the model is to the

I earn output and the critical thing about this loss function is that it has

to be differentiable with respect to the parameters in the model and when you

have that set up oh and of course there’s data as well so you have a bunch

of different examples of inputs and outputs and your loss function reflects

how well your model performs across all of them typically as just some average

overpour data point losses so with this set up you can then proceed to find the

optimal model through gradient descent the idea is that the gradient is a

mathematical object that tells you how much the loss changes in response to a

change in the parameters and then you want to knowing that change the

parameters in a way which is fruitful that is it reduces the loss it reduces

the measure of error so what makes deep learning deep what is the deep part it’s

this idea that function composition is at the core of the models that we make

and that we consider so function composition just means that you have a

bunch of different parametrized functions and the outputs of one are the

inputs to the next one and you can arrange these in many different

topologies we’ll call these architectures for neural networks the

very simplest kind is just one where you have an input layer and then there is a

matrix that multiplies that and then you maybe add some bias to that vector and

then you pass that through a nonlinear activation function typically this is

going to squash the outputs from that first linear transformation into

something which maybe is in the range from 0 to 1 or 0 to infinity something

relatively simple but that non-linearity happens to do a lot of work and then

when you have successive layers what it allows the model to do ultimately is

represent successively more complex features internally so you might think

of the output of each layer as being a new representation of the original input

which has maybe rearranged the information in a way which is easier for

some kind of final decision making procedure at the end of the network to

make the right decision based on aside from that very simple

there are also substantially more complex ones so the other two diagrams

on this slide are for lsdm networks so that’s in the lower left and the

transformer network that’s on the right an LS TM network is a recurrent neural

network the idea is that this is the kind of network that can accept a time

series of inputs and produce a time series of outputs and internally it has

some very complicated mechanisms for making sure that information gets

propagated effectively across time steps in a hidden state so that when you make

a decision somewhere in the future you can remember something that you saw in

the past and then you can update the network in a way which is stable and

reasonable the transformer network is substantially more complicated and it

allows networks to do something called attending over their various inputs so

at attention it’s something which is a concept that we can all kind of relate

to when we look at the world we don’t actually process literally every piece

of data that we take in concurrently we particularly attend to whatever happens

to be say in the center of our field of view or whatever we’re thinking about at

the moment whatever is most urgent and attention neural networks are able to

basically do that when they make some decision on the basis of a lot of data

they can select out the most important pieces of the data for making particular

kinds of decisions and that turns out to be very helpful in practice a few other

things about deep learning and this is mostly just I’m checking off some boxes

if you want depth on this I strongly recommend that you go see the spinning

up essay where there are a bunch of links to papers and other resources that

will give you detailed information about this but to check off the boxes we might

talk about regularizer x’ so the idea is that sometimes optimizing your loss

function picking the model that actually gives the lowest value of your loss

function may not be the best thing to do you may wind up with a phenomenon called

overfitting where you’ve made your model behave perfectly with respect to the

data that you showed it but then it does a terrible job when it’s given any other

data because it learned a decision rule which

was entirely too specific but with regularization you trade off the loss

against something which has nothing to do with performance on the particular

task but just kind of says hey cool your jets a little bit

don’t be so avid about satisfying that objective and then it turns out that

regularization actually leads to models that do a better job of generalizing to

unseen data then there are also a couple of things that make the optimization

process smoother and easier so you might do some kind of normalization technique

where internally there’s some output in the middle of the network where it’s

good to adjustably rescale that and shift it around and that’s better than

just letting the network do whatever it would have done if you didn’t do this

kind of normalization it’s sort of spooky and there are some legitimate

complaints inside the community about whether or not we really understand why

this helps but it seems to so it’s worth knowing about also you might use a more

powerful optimizer than standard gradient descent this comes up also in

reinforcement learning actually many of the things that we’ve been talking about

in these past few slides show up in deep reinforcement learning which is why I’m

bringing them up adaptive optimizers do something special

in figuring out how to tune the learning rate the amount by which you change each

parameter at each step of updating in a way which leads to typically faster

convergence so you get to the to the optimum point a little bit sooner or a

little bit easier there’s also the Reaper amortization trick but that’s

quite complicated and so we won’t actually talk about it it’s on the slide

so that you know where to look all right that’s all this stuff from deep learning

that I wanted to talk about now onto reinforcement learning so first

and foremost we have to talk about how do you formulate a reinforcement

learning problem what does that mean what does that do what are the pieces of

it how do they fit together we typically use the language of saying that there’s

an agent that interacts with an environment so the agent is whatever

thing is making some kind of decision the environment is wherever those

decisions are happening and the thing that creates the consequences of those

decisions and there’s this loop where the

environment has some state and has some measure of how good it is to be in that

state that’s a reward and the agent gets to observe the state and possibly the

reward it uses the reward for learning whether or not it observes it as a

subtle technical detail but anyway okay the agent gets a state observation

and a reward and then the agent makes some kind of decision about what action

to take it picks the action and it executes sit in the environment and then

the state of the environment changes there’s a new state of the environment

the agent perceives it the agent acts etc the goal of the agent is to figure

out what decisions will maximize the sum total of rewards that it’ll ever get

actually it’s slightly more specific than this and there are a couple of

different formulations that we can choose and we’ll talk about them

momentarily but that’s basically it in a nutshell we want to maximize this sum of

rewards that we get and the agent is going to figure out how to attain that

goal through trial and error so you just don’t know in advance what the right

thing to do is so you have to just try things see what happens see how much

reward you get and then adjust your decision on the basis of that so

reinforcement learning is about algorithms for doing precisely that but

before we can talk about the algorithms we have to introduce a bunch of

terminology for those of you who have done the work of going through the

spinning up material online this will probably be quite familiar and I’m

mostly going through it for the benefit of the audience that I expect might

watch this in the future as a starting point for this so bear with me I’ll try

to go through this reasonably quickly but we have to talk about observations

and actions policies trajectories rewards and returns what the RL

optimization problem actually is how we formalize it and then value in action

value functions and also advantage functions so there’s a whole lot of

stuff that you kind of have to know and unpack in order to really fruitfully

progress and reinforcement learning and and these are just those central pieces

so observations and actions a state is something which tells you absolutely

everything about the environment the agent usually doesn’t get access to

the state there is usually some stuff that’s just hidden from the agent so

what the agent perceives is called an observation if the observation contains

all the information in a state we called this environment fully observed if it

doesn’t we call it partially observed and states observations and actions can

be continuous or discrete for all of the problems that we care about in deep RL

the observations are continuous and the actions might be discrete or continuous

a policy is a rule for selecting actions there are a couple of different ways

that you can get to this kind of rule we typically classify them as one of two

kinds stochastic or deterministic a stochastic policy is a rule for randomly

selecting an action on the basis of the most recent observation or possibly

preceding observations as well a deterministic policy is just a map

directly from observation to action and no randomness involved at all you may be

wondering why it would be useful to have a random policy at all because it might

seem like randomness is just sort of dangerous but actually it can be quite

helpful and there are some very principled ways of optimizing stochastic

policies and it’s a little bit harder to optimize completely deterministic

policies there may also be a matter of robustness in that having a little bit

of randomness can make you more robust sometimes to perturbation then having

learned a brittle specific deterministic policy so now just to give some sort of

concrete examples in tensorflow because I assume that most of you will

probably have met tensorflow as your first deep learning library and if not

pi torch and for those of you who are stuck with tensorflow I’m so sorry you

probably should have picked pi torch I know I should have but but here we are

so in in tensorflow for a stochastic policy over discrete actions we might

first set up a placeholder for loading in observations and then we might set up

a multi-layer perceptron network and MLP network so this is just the most basic

kind of feed-forward neural network the thing that I talked about earlier which

is a succession of linear transforms of inputs followed by nonlinear transforms

of inputs in this case the linear transforms take

you to something of size 64 and there are two of them and then the activation

is at an H activation so this gets you to a range of minus one to one in a nice

smooth way and and then we produce logits based on the output from that

piece of the network so logits are basically something that proceeds having

probabilities for particular actions if you take the softmax of the logit that’s

not a function you’re familiar with I recommend looking it up it’s just

something that exponentiate Sall the logits and then divides by the sum of of

those exponentiated logins so so it normalizes the distribution to to being

a probability distribution all the entries have to be greater than zero and

sum up to 1 so we get logits and then we get actions by using TF multinomial to

sample something stochastically assuming that the probabilities are based on

taking the softmax of those logits you can ignore the squeeze that’s just there

for making sure that certain things actually work and then in the

deterministic policy let’s say we have a continuous action case so we want to

output a vector of actions where each entry can be any real value number we

will just go from observation to network to a final layer which is just going to

be the actions all right so that’s policies let’s talk about a trajectory a

trajectory is a complete sequence of states and actions through the history

of an environment the agent starts in a state takes an action then there’s a

next state next action etc the first state in the environment is sampled from

some previous distribution over starting States and then afterwards state

transitions are going to be either deterministic or stochastic but there’s

just some rule in the environment that given the current state and the current

action whatever action the agent took picks what the next state is a

trajectory is also sometimes called an episode or a rollout you’ll see this

terminology used completely interchangeably so just be aware that’s

out there there’s I’m so sorry in every new ish field a lot of terminology

confusion we’re different people in different

areas of academia worked on it for a while and use different terms and then

in the end we’re left with just a weird mishmash notation - you’re gonna see

some notation where states and actions are notated by s and a and then in code

you’ll see some places where it’s X and u and this is because of the ancient

eternal conflict between the control theorists and the reinforcement learning

theorists and we’re just stuck with it now so that aside let’s talk about

rewards and returns so a reward function is going to map from the states and

actions or states and actions and possibly next States on to just some

number that tells you good or bad positive is good negative is bad the

more positive the better and you have to if you’re a designer setting up a

reinforcement learning problem you have to pick with that reward function is

going to be so you want to make sure that you incentivize the stuff that you

want to have happen and disincentivize stuff that you don’t want to have happen

so as a very simple example suppose that you want a robot to run forward but you

don’t want it to waste a ton of energy so maybe you will give it a reward

proportional to its forward velocity but you’ll penalize it proportionally to the

some of the action magnet or to the action magnitude so you’ll discourage

superfluous actions the return of a trajectory is going to be some

cumulative reward along it we have two ways of formulating this and what you’re

going to find in deep reinforcement learning implementations is that we’re

going to completely conflate which problem we’re trying to solve with the

other but the finite horizon undiscounted sum of rewards works when

you have a finite horizon it doesn’t work when you have an infinite horizon

because if you have an infinite sum of things it might diverge unless you do

some kind of discounting so in this other case infinite horizon discounted

some of returns you have a discount factor gamma between 0 & 1 and that’s

how you down weight things that happen in the future this makes sure that this

is a reasonably well-defined quantity but why would it make sense to discount

things you probably would rather someone tell you that they’re gonna give you

$100 today than $100 in a hundred years right like it’s just good to get

upfront then there’s the reward to go this is closely related it’s basically

just a measure of return starting from a particular time step or state so the

reward to go from some point in time is just the sum of rewards that’ll happen

after that point in time and now we can talk about the reinforcement learning

problem just formally we’re going to set up a performance measure for a

particular policy PI J of Pi which is the expected value of return for

whichever formulation we’ve picked according to a distribution over

trajectories in the environment based on the choice of policy so what that means

is that again start states come from a starting distribution transitions in the

environment are based on something in the environment that transition

distribution P and actions will come from the policy conditioned on the the

observations of the states and we want to find the optimal policy PI star which

maximizes this now we have to talk about value functions so value functions are

measures of how much reward you expect to get from a particular State or state

action pair assuming that you’re going to behave a certain way so we have the

on policy value function and action value function V PI and Q PI which

respectively tell you how good it is to be in a particular state and how good it

is to be in a particular state action pair assuming that forever after being

in those places you act according to the policy PI and then there’s also V star

and Q star same thing except if you were to act according to the optimal policy

it’s great to know Q star as we’ll talk about momentarily value and action value

functions are connected the value is just the expected action value expecting

over what option you might take according to the current policy and the

advantage function tells you how much better a given action is than average

and it’s just the difference between Q and V these value functions satisfy

recursive bellman equations these are super important and they’re the

foundation of a bunch of algorithms so they’re really worth knowing and kind of

just worth grappling with I think that these can be particularly tricky at

first I remember the first time that I met reinforcement learning I was just so

turned around and lost by these the notion that there was going to be this

recursive equation where the definition of a thing depended on itself was quite

confusing but it’s it’s it’s worth just hitting your head on for a while until

it makes sense but what it’s saying is that the value of being in a particular

place is going to be as good as whatever reward you get for being in that place

plus all the rewards that you’ll ever get for all the places you’ll go

afterwards now why is it great to know Q star Q star tells you if you’re gonna

act according to the optimal policy forever after you started in this state

and took this action and we don’t care what policy this action came from how

well will you do so that means that if you want to do the best you possibly can

do all you need to know is what action maximizes Q star in a particular state

and then take that action because that’s gonna be the best action in that state

and then afterwards you’ve assumed that you’re gonna do the best that you can

ever possibly do so if you have Q star you basically have the optimal policy so

this is going to lead us ultimately to the two different kinds of algorithms

and reinforcement learning for control where in one case we’ll try to directly

optimize a policy and in the other case we’ll try to find Q star now if we want

to find Q star we have to set up a function approximator for it q theta

which will represent by some kind of deep neural network and we’re gonna want

to measure how good is it at approximating Q star and this is what

that recursive bellman equation is gonna be really helpful for because the

beautiful thing is we don’t need to have acted according to the optimal policy to

check how well Q theta fits that bellman equation we just need a bunch of

examples of state action next state and reward tuples and if we have enough of

those over enough of the environment then we can probably do a pretty good

job of fitting q theta based on that bellman equation based on maybe this

means squared bellman error and then use that afterwards for control which is

having a decision-making rule by the way I apologize if anything has been

confusing about my using sort of the terminology of control interchangeably

with the terminology of reinforcement learning when I say control I mean

having the best policy so now what kinds of RL

algorithms are out there behold a taxonomy which is much more restrictive

than it looks it looks very pretty and it looks very definitive but it’s

actually masking a lot of subtlety and you know detailed choices and the fact

that there’s actually a lot more bleed over between these things than you might

expect but at a very high level this is a

useful picture to start with that we have two different kinds of RL

algorithms ones where we have access to the model of the environment and ones

where we don’t so what that means a model of the environment is something

which tells us if we’re in a given state and we take a particular action what’s

gonna happen next the model would predict what the state of the

environment will be after that and that’s really useful because if we can

forward simulate the environment then that’s extremely helpful for evaluating

our current policy it’s extremely helpful for figuring out what a better

action would be than the one that we might want to take so if you don’t have

a model you’re quite limited you just have to figure out how to do well based

on experiences that you’ve seen your direct interactions of the environment

you don’t get any other information but if you do have a model it’s quite

potentially powerful although as we’ll discuss the methods for model-based

reinforcement learning are not quite as mature so far as the methods for model

free reinforcement learning so now okay that last slide was just a ton of

acronyms maybe not that insightful let’s talk about what these algorithms are

doing there are three key pieces in any reinforcement learning algorithm for one

you’re going to run the policy in the environment you’re going to actually try

things and get to some signal error or otherwise and then you’re going to have

to reflect and evaluate whether or not those decisions were good ones whether

or not those actions were the right ones you have to figure out how good your

current policy is so that you can use that information to improve it so you

run the policy you evaluate the policy you improve the policy and there are a

bunch of different ways of doing that and we’ll go into some depth about how

different algorithms go about doing that

so let’s start with policy optimization

minor interlude in the chat last night I surveyed people to see what they were

interested in I asked if people were interested in

math there’s gonna be some math so first at a very high level zooming out ten

thousand foot view in policy optimization we’re going to run the

policy by collecting complete trajectories or snippets of trajectories

based on our current stochastic policy and we’re going to explicitly represent

that stochastic policy with a neural network that perhaps gives these

sufficient statistics of the action distribution or something else that we

can use to derive that and sample from it and then we’re going to evaluate the

policy by figuring out the on policy value function and advantage function

and we’re going to evaluate those things for all the states and actions in the

trajectories that we sampled and then we’re going to improve the policy by

making it more likely that we take the actions that led to higher advantage and

making it less likely that we take the actions that led to lower advantage less

likely that we take the bad actions how do we do that we’re going to have to

talk about some math now I realize there’s a chance that most of you maybe

weren’t expecting that we would be doing any kind of deep mathematical excursion

but if there’s one thing that I want you to take away from today aside from just

being excited about deep RL it’s a realization that there are some

limitations to what deep RL can currently do and that this is not really

a hundred percent done as a technology where you can just apply it to a problem

without really thinking about what it’s doing under the hood and get a good

solution it’s not a black box technology yet so if you want to try deep RL on a

problem and grapple with getting it to work you do have to kind of understand

what’s going on under the hood and that means taking a look at some of the gory

mathematical details understanding how they connect and forming an intuition

for how those details will shape the failure

most of your algorithm so what we’ll talk about we’re just gonna talk about

vanilla policy gradient we’re gonna talk about how you derive the policy gradient

and a bunch of different equivalent expressions for it and then we’ll get to

the pseudocode for the sort of standard version of vanilla policy gradient which

includes maybe a few more tricks and details than the very most basic vanilla

version apologies for the choice of words there but all of this stuff is

critical to understanding more advanced policy optimization algorithms like Terp

oh and PPO we won’t be covering them in these slides but again happy to talk

about them offline during the hackathon so in policy grading algorithms what we

want to do is we want to find some kind of expression for the gradient of the

policy performance with respect to the parameters of the policy and we want to

just directly gradient to send on those parameters so we’re going to move the

parameters in the direction that increases performance and is this gonna

be easy or hard well if we just try putting the gradient onto the policy

performance we run into a problem all the parameters are down here in the

distribution they’re not inside here where we would like them if we want to

get something that we can actually use we’ll have to do some messy work to

bring the gradient inside of an expectation which we could then form a

sample estimate of so step one to getting the gradient symbol somewhere

helpful we’re going to recognize that this expectation can be rewritten as an

integral going through all of the events in trajectory space every possible

trajectory of the density the probability mass or density for that

trajectory based on that policy times the return that you would get for being

on that trajectory and now we can bring the gradient in because the limits of

this integral don’t have anything to do with the parameters and then we apply

the log derivative trick so this is a really helpful mathematical trick comes

up all over the place and deep reinforcement learning it’s basically

just this notion that the derivative of log of sum

thing is one over that something times the derivative of that something and we

rearrange it slightly but it lets us go from the gradient with respect to theta

of P to P times gradient log P this is great because now we have an expectation

again we have an expectation based on trajectories sampled according to the

current policy so if we have that data we can make a sample estimate certainly

so the very nice thing here is that what we did after bringing the gradient

inside the integral and doing this log derivative trick is that we now have

something which is an expectation again because we’re integrating through all

possible trajectories of the probability density associated to that trajectory

times something which is a function of that trajectory so this is an

expectation and we can form a sample estimate of it that we can use in a

practical algorithm but we’re not completely finished yet because we still

have to talk about what’s the gradient of that log probability for a trajectory

how does that depend on the parameters of the policy so let’s go back to the

picture that we had in the beginning there’s a starting state which is drawn

from some distribution based on the environment and then after that you pick

it the agent picks in action based on PI theta and it has probability PI theta a

given s for time step 0 then the environment picks the next state

according to whatever distribution it has over next States given your most

recent action in the most recent state by the way this is something that I lost

over earlier slightly more formalism details that you don’t quite need to

know but this is called the Markov property this notion that picking the

next state only depends on the most recent thing that happened and doesn’t

depend on the past before it that’s the the Markov property and you’ll find a

whole bunch of math if you go digging for it but you don’t have to for for

this at the very least so then what we have is that the probability of the

trajectory is going to be just the probability of that first state x the

probabilities of each transition and action selection that happens afterwards

so we get that expression up there at the top and now if we want to take its

gradient of its log we just pretty straightforwardly compute first

the log of that thing turns that product into a bunch of sums the gradient goes

through the sums and now all the terms that are based on distributions from the

environment have no dependence on the parameters of the policy the environment

doesn’t care what the policy is it’s just going to behave in whatever way it

does so those have no dependence on the parameters those derivatives are zero

and what we’re left with is just something which is as some overtime

steps of gradients of the policy and the beautiful thing is because we control

the policy and we have explicitly represented it as a neural network and

we can compute all of its gradients this is a thing that we can calculate so now

we’re at something where we can in fact calculate a sample estimate of this

gradient of policy performance and use that as the basis for a gradient ascent

algorithm for improving performance but it’s not good enough we’re not done yet

yes the function capital e so so this this capital e is an expectation and if

we want to form an estimate for the expectation so we’re not going to

compute the expectation exactly what we’re going to do is we’re going to see

what happens for a bunch of different trajectories that are sampled according

to the distribution specified in that expectation and then we’re just going to

average them and in the limit as we have an infinite amount of data that sample

average becomes exactly equal to the expectation yes

absolutely absolutely you can so it is a bunch of derivatives of the final output

with respect to each one of the parameters right because there are many

inputs to this function and we’re going to have a derivative with respect to all

of them yes I’m sorry can you repeat the question yes can we tie this explicitly

to reward so inside the expectation here we have R of tau so that’s the return

measure that we’ve chosen whichever one we picked either the infinite horizon

discounted sum of rewards along the trajectory tau or just the finite

horizon undiscounted sum of rewards so that R of tau is the sum of all the

rewards in a particular trajectory and that’s actually why the the variance of

this is going to be so unnecessarily high they’re going to be a bunch of

terms in this sample expression actually just in that expectation which which

have expectation zero on average they’re zero they don’t contribute anything but

we sample them anyway and the samples will have noise on them and so we’ll

just wind up getting the noise we won’t get much signal from them so can we

eliminate a whole bunch of terms yes we absolutely can the intuition here is

that if I give you a reward in the past and you want to update the action that

you just took really what you care about for figuring out whether or not the

action that you just took was good or bad are the consequences of that action

you don’t care about what preceded it that action and what preceded it are

almost completely uncorrelated there you’re not going to to get anything by

by updating the likelihood of that action based on an old reward so that in

expectation is going to be zero and knowing that we can now expand out this

return measure and we’re going to get this in the finite horizon case just

for simplicity but this analysis also extends to the infinite horizon case so

we now have a sum of grad log probs of the policy times the sum of rewards

we’re gonna pull the sums out of this expression so that we can just look at a

policy update at a particular time step times a reward from a different time

step and then based on that thing that we asserted above we’re gonna drop all

the terms that are inconsequential all of those are zero and so what we’re left

with after we take away all the ones where T greater than T Prime we’re left

with this sum sum over the time steps for the policy times a sum over time

steps for rewards that goes for all of the time steps after the corresponding

policy time step and then if we bring that back in what we’re seeing now is

that we want to for each time step adjust the probability of the action

from that time step in proportion to the sum of rewards that came afterwards only

the consequences of an action will affect its update yes so it’s not that

you don’t consider past actions the sum over here in the beginning runs over all

time steps so every action is going to get to some update it’s just a matter of

which rewards are used in figuring out the update for that action and it should

only be the ones that were consequences of it yes yes

um well we do care about the future right because here we have a sum of

rewards after a particular time step all the rewards in the future from that time

step so so that expectation that’s just saying that an action that happens later

shouldn’t be affected by a reward that happened before it it should only only

be affected by the rewards that happen afterwards so in the in the next slide

actually we’ll see how this expression that we have down here at the bottom

connects to the value functions so what we currently have is what I’ll call the

reward to go policy gradient because what we’re doing is we’re adjusting the

probabilities of action proportionally to the reward to go what we’re going to

do now is go from that into an expression that has q pi the action

value on policy for a state action pair instead of that reward to go and this

works because you can break up the expectation so first we’re gonna pull

the sum over time steps out of this and then this expectation over trajectories

this is sort of subtle and and maybe a little math here then we can go into

detail on here but I recommend that you go look on the spinning up website in RL

intro part 3 there’s a link separately to a proof about this but if we think

about the average thing that’s going to happen over all trajectories that’s

going to be equivalent to the average thing that happens over all of the cases

of something with the first T time steps of the trajectory we’re inside of the

expectation we’ve moved all the stuff that happens in the future

and we were able to move it inside past this one because this only depends on

time step T this doesn’t depend on stuff after T so only this

it’s gonna be affected by averaging over the future and then it turns out that

that expression the average sum of rewards that you get starting from a

time step assuming that the state and action for that time step were fixed

that’s exactly equal to the action value that’s exactly saying how good is it to

be in a particular state take a particular action and then forever after

act according to a particular policy and now we have this expression for the

policy gradient at the bottom we’re most of the way through the math okay but

what is a baseline a baseline is a really important thing because it’s

another tool in our Arsenal for taking a policy gradient expression and turning

it into something which is lower variance more likely to be useful for

producing a good update to the policy and it’s also the namesake for opening

eye baselines well let’s save one of them it’s a

couple of things but we have a expression here at the top which I claim

is basically true which is that the gradient policy gradient is the thing

that we had before but instead of Q we subtract out some function of state some

function b of st and i claim that in expectation it works out exactly the

same and so there’s a short proof here for that which is that if we look at the

expectation for that part of it what happens if you take the expected

gradient of the log probability of an action in a state times some function b

of st the b doesn’t have anything to do with the action so it’s a constant with

respect to this expectation so we pull it out and then what we’re left with is

an expectation over actions which will rewrite and now we have it in

probability times grad log prop we’re going to reverse the log derivative

trick from earlier so this is now an integral over actions of the gradient of

the probability of that action and we can pull out the gradients

we’re just sort of reversing the procedure from earlier this thing this

integral over all possible actions of the probabilities of those actions

that’s just going to sum up to one that’s just saying probability

distribution is normalized all of the chances together have to come out to

equaling 100% of sum them up and the derivative of a constant since that’s a

constant is nothing constant has no rate of change

so we get zero so all of the terms of grad log prob times the baseline in

expectation or zero so we’re free to add this baseline without changing what the

policy gradient is in expectation but we can pick it in ways that are fruitful

and make the estimate better so the typical thing to do is to pick the

baseline to be the value function and this leads us to kind of our our final

sort of ultimate form of the policy gradient the form with advantage

functions and why is this good why is this good the advantage function says

how much better in action is than average why would you prefer that over

just how good the action is well let’s say you have two actions one gets you a

hundred dollars one gets you one hundred and one dollars you only sample the one

that gets you one hundred now when you’re trying to update your policy you

can feel really great about that oh man 100 is a big number I feel great

I’m gonna double down on that action you’re acting sub-optimally if you had

been picking 5050 on average you would have gotten a hundred dollars and fifty

cents and you would have realized that the advantage of taking the action that

you picked one hundred dollars and fifty minus a hundred dollars and fifty cents

you lost fifty cents should pick the other action so you prefer to use

advantages to figure out which actions to increase the likelihood of as opposed

to just Q values all right summing it up we have these four different forms of

the policy gradient they’re all tightly connected we care about the last one but

to get to the last one we had to go through the pain but now that we’ve all

gone through that pain together you’re stronger you can go and you can

implement this and it’ll work and you’ll know why it works and you’ll feel good

about that and if it breaks you can fix it

all right so then just to sum it up this key concept we want to push up the

probabilities of good actions push down the probabilities of bad ones and also

importantly that expectation requires trajectories sampled from the current

policy so this is the concept of being on policy and reinforcement learning

that if you want to update your policy you have to use data from that policy

you can’t use data from some other policy unless you appropriately

reweighed it but relating data is complicated and really tricky so it’s

sort of preferred to not do it unless you are trying to build something new

and cool and super sample efficient and you’re willing to spend a lot of time

and effort doing research on making sure that it actually works but ok so the

policy gradient expression gives us the policy improvement step coming back

coming back a bit oh yeah sure the question was how do we know what the

average reward would have been so that we could figure out how to make the

advantage function in the first place do we compute it as we go and and actually

that’s exactly what the next slide is about which is how do we do that

business of policy evaluation how do we find an estimate of the advantage

function which is actually good and reasonable if we just have a bunch of

data where do we get the value function that we might use to subtract out as a

baseline and the idea here is that we’re going to learn it from data and

typically it’s going to be by regression so this will be a subroutine that you’ll

find in most policy optimization algorithms where you’re going to have a

value function approximator another neural network and you’re going to at

each epoch of the policy optimization algorithm update the value network to

try to match the empirical returns that you saw so for a particular state the

value should be more or less the sum of discounted rewards that you saw off to

then and then when you have the value

function approximator you can use that to estimate advantages and we’ll talk a

bit about estimating advantages from value function approximate us on the

next slide but first you may have noticed that I pulled a fast one on you

which is that we went from in all the preceding slides dealing with the finite

horizon undiscounted case and then here in our optimization problem for learning

the value function I’ve dropped in discount factors why is that the answer

is because everyone does it this is where there’s not a particularly good

reason in my opinion that this happens but pretty much every policy

optimization algorithm that I’m aware of every every single implementation uses

discounted value functions and advantage functions but then treats the policy

optimization part as undiscounted it creates some bias it seems to work

shrug it’s perfectly reasonable to do that so it sometimes seems to be helpful

to set the discount factor to something a little smaller than one so keeping it

completely undiscounted would be gamma equals 1 for whatever reason with some

optimization problems there’s some some RL problems it’s a little bit harder if

you pick gamma equals 1 than gamma 0.95 and i can’t say that there’s a

particularly good reason for this I would speculate that like in the

beginning of training if you pick a very high discount factor those empirical

returns will be very noisy and if you choose a discount factor less than 1

what you’re going to do is you’re going to attenuate some of the noise you’ll

bias that sum of rewards so that whatever happens soonest matters most

and if you happen to see a few positive rewards in a row then you’ll latch on to

that whereas maybe because of noise if you had really paid attention to

everything out to infinity you’d have just gotten a bunch of positives and

negatives and positives and negative and they would have cancelled out uh I think

it’s it’s ok to think about it like that yeah

yes yes that after a certain point the trajectory just ends you get it to time

step T and then it’s over that’s finite horizon infinite horizon

you go out to infinity alright so then how do we calculate the advantage

function given data from trajectories and a value function approximator so a

thing that I want to introduce here is this notion of n step advantage

estimates so what you’re going to do is you’re going to have a thing over on the

left side that approximates Q pi and a thing over on the right side that

approximates V PI so this thing for Q pi remember that that’s supposed to be an

estimate for how well you’ll ever do if you start in a state take an action and

then act according to the policy forever after you can just use the empirical

return the reward to go from that state as a sample estimate of the expected

value which is the Q value but in an N step advantage estimate what we’re going

to do is we’re not going to go all the way out to the end of the trajectory in

that sample estimate for Q we’re going to go n steps in and then use the value

function approximator to assume what’s going to happen for the rest of it and

this corresponds to a decision about how much bias or variance we find acceptable

in this advantage estimator so if you pick n equals 0 then your advantage

estimator in that case would be just the reward plus gamma times the value

function approximator for the next time step minus the value function

approximator for the current time step and that’s gonna be very high bias

because whatever is wrong with your value function is not going to be wrong

with your advantage function but it’ll be really low variance because the only

thing that’s going to have variance to it is the reward and the stochasticity

in the next state transition but if on the other hand you pick n equals

infinity so for the q approximator you just take the exact sum of rewards that

you got in the real trajectory and then at the end you subtract out the value

function at st you’re going to accept all of the variance that’s in the

environment but the nice thing is you don’t have

bias in forming your policy gradient estimator with this because in

expectation the Q part is going to be exactly Q in expectation and the B part

recall that that was a baseline that we added with a guarantee of no bias in the

policy gradient so on expectation that part falls out and it’s fine

so the bias-variance tradeoff is typically mitigated through what we call

generalized advantage estimation so this is a way of interpolating between all of

those different possible choices of n step advantage estimate where we use a

factor called lambda so this is sort of like another discount factor as the

interpolation variable and it’s a hyper parameter and you choose it in each

implementation that you make and it’s generally good to set it somewhere

between like 0.9 and 97 usually it’s a set it and forget it in my experience I

can I can’t think of very many cases when I saw a substantial difference in

algorithm performance from adjusting it beyond that kind of narrow range if you

set it equal to one then you’ll get exactly the case of the N equals

infinity and if you set it to zero then you’ll get exactly the N equals zero

case so it’s good to kind of leave it in the range where it’s putting a little

bit more weight on the real empirical returns than the biased value estimator

but not all the way to the extreme okay at long last I give you the pseudocode

for the full vanilla policy gradient algorithm that incorporates everything

that we’ve talked about so far what we’re going to do is collect a set of

trajectories by running the current policy in the environment and then we’ll

compute the rewards to go so that we can use them as targets for the value

function approximator will compute the advantage function estimates with any

method of advantage estimation but typically generalized advantage

estimation and then we’re gonna use those to estimate the policy gradient

with that we take a step of gradient this gradient descent we might use an

adaptive optimizer like Adam to accelerate the rate at which we learn

and then we’re going to do the supervised learning problem of trying to

get the value function approximator to match the empirical returns and that’s

how we learn our value function and then we loop that’s vanilla policy gradient

yeah absolutely so yes usually you will pick networks

have the same size for policy and value function in cases where the environment

is partially observed you may want to have a single core recurrent neural

network that’s going to be able to remember past information and then give

that corner all Network separate outputs for policy and value function and then

you’ll train that jointly and it gets a little bit complicated because I can’t

say that there’s any good work in RL theory that I’m aware of that reasons

about how it alters performance for the final policy to be simultaneously

optimizing with respect to both objectives on the same model but that’s

what you would do in that situation so so yes typically they’ll be about the

same size unless they’re actually sharing parameters and then they’re sort

of the same model yes

does the choice of initial policy affect convergence wonderful question and sadly

in a lot of cases yeah so this is part of what goes into my saying that deep

reinforcement learning is not a technology that’s ready to be used as a

black box yet so when we do experiments in deep reinforcement learning we

typically run the same exact experiment with different choices of new of seed

for the random number generators and what we find is that the seed which in

the beginning of the algorithm only changes the initialization of the

policies and value functions happens to matter quite significantly some seeds

learn some seeds don’t some seeds learn much slowly much more slowly than others

and there’s no particularly good reason for it

we are generally quite heartened when we find an algorithm that appears to be

robust to initial conditions and where the

average of the learning curves is quite narrow we think that’s great and it

doesn’t quite happen as often as we would hope all right do we have any

other questions about policy gradients so in the bottom right hand corner there

that says 47 out of 63 I may have slightly miscalibrated

how long parts one and Part two were relative to the initial time slots of 45

minutes and 1 hour respectively this is by far the longer one but since we’ve

been at it for an hour I think this is a good point to take a 15-minute break and

we’ll pick back up to discuss q-learning after coffee thank you so much

we will

we will be resuming with Joshua Humes introduction to RL in two minutes

hello

hi everyone we’re about to get started for the second part of intro to RL and

just as a heads up I prepared entirely too many slides for

the hour and 45 minutes that I was scheduled to speak please bear with that

because you know this is the first time we’re doing this and so I’m still

getting calibrated on what we can get through in that amount of time but

everything that I don’t cover by 11 a.m. when I hand over the mic to the next

speaker I’m more than happy to share with you later today during the

hackathon so in particular the material that I expect that we won’t quite get to

will involve an overview of what’s been accomplished recently in deep

reinforcement learning and where the challenges and limitations are and what

the research horizons look like on those limitations but before we do any of that

let’s continue our discussion from earlier and talk about the next major

family for algorithms for deep RL for control which is to say cue learning so

there are a lot of algorithms that fall under this umbrella deep Q learning was

one of the first algorithms that really made deep reinforcement learning viable

and popular speaking from personal experience I just started my graduate

student career in 2014 when I heard about the playing Atari with deep

reinforcement learning paper I was just becoming aware of topics in AI and AI

research and that completely and totally blew my mind it was the most exciting

thing that I had ever seen that a computer could just figure out from

looking at what was happening on a screen how to behave how to play a game

how to do something that I thought required some human spark of

understanding and capability for joy and the in the computer had it it was

beautiful and amazing and it made me want to study this and participate in

taking this technology all the way from where it was at that point to what it

could be in the future anyway q-learning

so back to this RL loop that we have run policy evaluate policy improve policy in

q-learning you run the policy by taking a step in the environment either

randomly so there’s going to be some stochasticity in what you do or you’re

going to act in a way which is called greedy with respect to your current Q

function approximator so remember what you’re trying to learn is Q star the

optimal action value function and if you happen to have Q star then whatever

action is the maximum or maximizes q theta in a particular state is the best

action to take um but when you don’t in fact have q theta equals Q star then the

the maximizing action probably isn’t great so exploring a little bit by

acting randomly is going to help you and then once you’ve taken that step in the

environment so you send an action to it and you get back a reward in the next

state you store that transition state action reward next state in a replay

buffer you save it for later because you’re going to use it for learning how

to evaluate the policy which is to say updating q theta to try to have it fit

that bellman equation and once you have that the policy improvement step is just

looking into q theta and saying what’s the action that maximizes this policy

improvement is basically implicit in Q learning and we’re gonna structure our

discussion about Q learning around the original deep Q networks algorithm but

pretty much everything in this discussion is quite general for Q

learning methods because they all kind of share this common DNA of you take a

step in the environment you take some gradient descent steps on your Q

function to minimize a mean squared bellman error and you use the techniques

that will describe in a minute experience replay on target networks to

stabilize the learning procedure so Q learning updates by bootstrapping so

what is what is that it’s this notion of how

are we actually going to fit q2 that bellman equation so we talked about

minimizing mean squared bellman error and it’s a useful picture to start with

and so I’m gonna keep using that terminology although in a few slides I’m

going to tell you something completely different and ask you to ignore this and

pretend you never heard it but this is where all the papers start and this is

where all the tutorials starts so it’s good to familiarize you what you’re

going to do to update Q is set up this loss function where you’re going to

average or sum over data from your replay buffer D and you’re gonna have

these transitions state action next state reward and you’re going to regress

Q theta against targets Y where those Y’s are obtained basically from that

bellman back up from that bellman equation as the reward plus the Q value

in the next time step and this is based on the bellman equation for the optimal

action value function so it’s gonna have that Max over next actions which is to

say that it’s going to assume that you know if Q theta was optimal if it was Q

star then whichever action maximized it in that state would be the best one to

take and that would be the best value there so interestingly you don’t

propagate gradients through why even though why has the dependence on the

parameters of Q theta and the reasons for this are kind of mathy so we’ll get

to them in a bit okay getting this to work so there are two

main techniques that I mentioned there’s experienced replaying there’s target

networks the idea behind experience replay is

just that you want to use a really wide distribution of data for training your Q

function you don’t want to fit it really well to a very narrow region of

transition space because if you do it’s not gonna be good anywhere else and if

it’s not good anywhere else you’re not going to be able to bootstrap it to the

correct values even in the places where you’ve been trying to fit it

you’ll get nothing which is actually useful for control so experience replay

helps you broaden that data distribution fit q well everywhere gets something

which is good for control target networks

so bootstrapping with function approximator is super super super

unstable that thing that we said on the previous slide where the Y’s depend

exactly on the current Thetas actually throw that out can’t do that that won’t

work if you try to do it what’s gonna happen

is typically the keys will explode they’ll go to something really large or

really negative and that’ll happen really fast you won’t be able to control

it even with reasonably well tuned learning rates you probably won’t be

able to stop it so instead what we’re gonna do is we’re gonna have target

network Q theta Targ and we’re gonna make sure that that network tracks

reasonably closely to Q theta but there’s going to be a lag so that it

updates more slowly so that if you make an update to Q theta which pushes a Q

value too high or a little too low then that doesn’t immediately propagate into

Q theta Targ and therefore does not propagate into the bootstrap so this is

this wide thing we’re gonna call this the bootstrap and then this tamps down

on instability grants it why if Q learning is so horrific ly unstable

would we want to do it like this in the first place why wouldn’t we just

differentiate through with respect to that bootstrap and the answer is it if

you differentiate all the way through it tends to not work that well and the

reason that this thing does the reason that it works well if you do this kind

of bootstrapping approach as long as you take some appropriate precautions has

something to do with the theory underlying Q learning and we’ll talk

about that in a few slides but not quite yet you’re spared for now so also

another note in DQ networks the particular algorithm that we’re talking

about right now action space matters a lot so what we

did in describing that bootstrap we had a maximization over actions of the q

function if you have a q function that accepts as input a continuous state and

a continuous action and feeds that into a deep neural network trying to figure

out the action that maximizes the Q function output is really hard that

would be a non-trivial optimization problem an expensive subroutine so if we

want to be able to get that max over actions that’s a case

where we won’t really be able to do it so dq1 will apply specifically to the

discrete action case where we’re able to use a network architecture that instead

of taking a continuous action as an input at the bottom of the network emits

action values for each possible output for each possible action at the end of

the network so a single observation goes in and then K action values come out

where K is the number of actions one for each action and then because there’s

just a finite number of them it’s very easy to figure out which action maximize

the Q value we can compare all of them directly so now but we can talk about

the pseudocode for deep Q learning this is relatively straightforward based on

the stuff that we just described there’s one thing which is a little more

specific than what I mentioned which is this business of Epsilon greedy

exploration so I mentioned before that you’re going to explore by sometimes

taking a completely random action and sometimes taking the action which is

greedy which maximizes your current Q function approximator so epsilon greedy

is a strategy for doing that where with probability epsilon where epsilon is

going to be something small you’ll pick a completely random action so uniform

random over the K different choices and with probability 1 minus Epsilon most of

the time you’ll pick the action that’s greedy with respect to your current Q

function so that’s the run policy step and then after you store that transition

into the replay buffer and anneal Epsilon because over time you want to

explore less and exploit more you want to rely on the policy as it gets better

after doing that you’re now going to evaluate the policy by learning Q star

from the data by improving q theta to be a better reflector of Q star so that’s

exactly the step of gradient descent that we described which is that you

sample some transitions from your replay buffer from your from your experience

replay memory and you compute the bootstraps for those transitions and

there’s a special case for if a transition ended in a terminal state

which that we don’t give it a value after that

particular time step and then we use those Y values in our bootstraps Q value

regression update the parameters and then every once in a while with some

frequency will copy over the parameters of the main q network onto the target

network so that’s the target network lagging the q network ensuring stability

and that’s deep Q learning in a nutshell this algorithm kicked off everything I

mean a whole bunch of stuff that preceded it you can’t really point to

any one moment in the history of a field that you know had no precedent before

this there was neural fitted Q before that there was Q learning with linear

function approximation and there were all kinds of algorithms for trying to

get things to work with nonlinear function approximation like deep neural

networks but but but this was the one that got a lot of people really really

excited so anyhow caveat emptor buyer beware this can break this will not work

on every problem out of the box you’ll try it in some places and it just won’t

work you’ll fiddle with hyper parameters and it still won’t work you’ll try some

tricks to stabilize it because there are pretty much infinity tricks to make deep

Q learning better at this point and some of the time that still won’t work so

this picture here is from a recent paper which I really love and which I strongly

recommend that you take a look at if you get interested in seeing some analysis

of failure modes for algorithms in deep RL it’s called deep reinforcement

learning and the deadly triad the deadly triad is a set of traits that deep

reinforcement learning algorithms might have which are known to occasionally

cause divergence and to create substantial obstacles to theoreticians

who would like to come up with algorithms that have provable

convergence guarantees so the deadly triad consists of function approximation

off policy learning and bootstrapping which are exactly the three things the

deep Q learning relies on we have function approximation in the form of

neural networks we have auth policy learning in the form of

spirits replay and we have bootstrapping in the form of using the target network

with a one-step backup as the regression target for q and so deep Q learning

works a whole lot of the time and then some of the time it just doesn’t so in

this set of experiments what the researchers did was they examined deep Q

learning and a few variants of it a bleeding on whether they would include a

target network so here this Q does not have a target network the regression

target that it uses is exactly based on Q theta naught Q theta tark and tried it

with a target network and then tried a couple of other tricks that relate to

how you use the target network to possibly either estimate the value in

the bootstrap or select the action in the bootstrap and those are tricks that

are known to potentially help they looked at at all these different cases

for many different Atari games as the experimental test bed and they clipped

the rewards in the environments into a certain range so that they knew exactly

mathematically what the ceiling for possible real Q value would be they

chose it to be a hundred and they looked and saw over all the experiments that

they ran how often did the maximum absolute learned Q value in an

experiment exceed the threshold which they knew was the real true maximum

possible Q value and the answer was a lot so this shows that Q learning

without target networks is very unstable in that a lot of the time you will get

this this divergence phenomenon and even as you include tricks that make it

progressively more stable you’ll still expect to see divergence every now and

then so we’re gonna dive into a little bit of math now to kind of get maybe

some intuition for why this is the case and what deep Q learning algorithms are

really trying to do and how that translates into the algorithm or doesn’t

so we’re going to start by taking the operator view of the bellman equation so

the optimal bellman operator t’east is a map from cue functions on to other

cue functions and the value of T star for a particular state action pair is

given by the the cue right by the bellman equation that we saw before the

optimal cue function is the fixed point of T star so Q star equals T star Q star

that’s great and T star has this special thing about it which is that it’s a

contraction map on the space of Q functions contraction maps have some

very special properties that we’re gonna talk about now yay

so the main thing about a contraction map is this idea that if you have two

points and you apply the contraction map to both of them they’ll basically be

closer with respect to some distance function after you’ve applied that map

to both of them than they were before so expressed mathematically we have some

some norm some distance of the norm of a thing minus the other thing and the norm

of f of X minus f of Y is going to be less than or equal to some constant

factor times the norm of the difference between x and y that distance between x

and y and when that beta is less than one then we have a contraction that’s

saying it’s getting closer together it’s shrinking why do we care about

contractions because they have unique fixed points and you can get to them by

just repeatedly applying the operator to any initial point this is something

called the binocs 20 room if you’re interested in going on Wikipedia and

finding something which is going to be more precise than however I’ve typed

this up but in a nutshell to show you that they have unique let’s forget about

uniqueness for a moment but at the very least that repeatedly applying this

operator will get you to a fixed point if we look at a sequence of points X and

we have a contraction map F with modulus beta and each point in the sequence is

just yet generated by F of the previous point and we look at the distance

between successive iterates what we see is that it’s shrinking as a function of

the iteration number so in the limit as the iteration number goes to infinity

that distance will shrink to zero it will converge repeatedly applying it

will get you to the fixed point t star is a contraction on Q functions so if

you could represent the entirety of the Q function that is to say the Q values

for every state action pair in the entirety of the environment which for

all the environments that we care about in deep reinforcement learning you

cannot easily do you can only do this with function approximation which is to

say you’re going to generalize whatever you choose for the value in one state

action pair will have some influence on another you can’t completely separate

them when you do function approximation but putting that aside so we could

represent all the action values for every state action pair and we applied T

star the operator to that function we would get a new function Q which is

closer to optimal than the one that went in and if we applied it over and over

and over again we would eventually get to Q star the fixed point of T star this

is value iteration it’s a classic algorithm and

reinforcement learning so before function approximation before deep when

you had environments where there were a discrete number of states and a discrete

number of actions and you could represent the Q values in a table of

elements one for each state action pair you could compute this exactly and use

this as a way to get to Q star now when you live in the problems that we do when

you’re trying to solve high dimensional complex video games high dimensional

complex strategy games you can’t use the table yet use a function approximator

and now your problem is that you can’t compute all of T star Qi and even if you

could you probably couldn’t find a choice of parameters that would allow

you to exactly represent it so if you want to do this kind of value iteration

you have to do it approximately and this is roughly what Q learning algorithms

with function approximation try to do which is that they push the parameters

of the network in the direction such that you move Q theta towards T

star q theta and sometimes this works and sometimes it doesn’t because when

you go to this function approximation setting this operation is not

necessarily going to be a contraction on the space of Q functions you might have

lost that property if you did expect divergence in fact I expect things to

blow up horribly if you preserved it or if you’ve done enough tricks to

stabilize it things will work pretty well in my experience Q learning

algorithms and their variants tend to be extremely sample efficient when they

work which is quite desirable and it’s very nice if they can recycle off policy

data because on policy methods sadly have to throw away tons of it but last

point on Q learning what you normally see in deep learning algorithms and deep

RL algorithms is that paradigm of there’s an objective function and you

optimize it and you find the model that optimizes the objective in Q learning

don’t be misled into believing however many times you see it that the mean

squared bellman error it’s really the thing that you’re optimizing you change

that function every time you change the target the thing that you’re really

doing is this sort of approximate value iteration you’re trying to apply an

approximate operator which is going to get you to something better you’re not

trying to minimize a loss that’s not to say that there aren’t variants of these

kinds of algorithms that do involve well-defined loss functions there’s a

whole family of algorithms called gradient temporal difference methods

which if you are theoretically inclined and willing to go down a deep deep deep

rabbit hole I recommend you check out talk to me if you want references also

in the spinning up key papers doc I believe there’s a book in the bonus

section for classic RL papers and review papers choppa Sabbath Baris book on RL

algorithms from 2010 which recaps a lot of this really great old stuff including

gradient temporal difference algorithms so I recommend you check that out if

you’re interested yes I’m actually working on some research on that right

now like I

talk to me offline yes yes yes so so this thing yes it’s called a temporal

difference error because it is the difference in the Q value based on the

next time step versus the current time step yeah yes absolutely what is the

difference between off policy and on policy the on policy algorithms have

updates which are based on the expected values of things where the distribution

and that expectation depends on the current policy so if you want to form a

sample estimate of the thing in the update equation then you first have to

run the current policy collect interactions with the environment on the

current policy and use those samples for forming that sample estimate that’s on

policy because all the data that you use has to be generated by the policy that

you’re using at the time in off policy methods like q-learning what you do when

you make an update is you use experience which might have been generated by older

policies not the current one so the current policy you could think of as

being implicitly expressed in the in the Q function approximator is current value

but many steps ago it was different and you got whatever data you got from

interacting with the environment you put that in your replay buffer and then many

steps later you still sample those states and actions from that replay

buffer to help you form your your new update to the current q function so when

the data was generated by a different policy that’s off policy yes

in what sort of gaming situation would we maybe use deep q-learning or like

what’s a use case for it so there’s a fabulous use case actually Facebook

recently released a paper on their machine learning and RL learning there

RL platform called horizon which they used to train with deep Q learning

neural networks for making decisions about when to send you push

notifications so actually DQ n is in your phones right now okay then let’s

proceed to the next part which is model based stuff so I’m going to be pretty

brief about model based stuff there’s a very wide variety of different model

based algorithms and we’re not going to drill down into them the way that we

drill down into policy learning and Q learning but we will give a relatively

brief overview of some of the more salient points and a few algorithms that

I think are particularly interesting so back to the loop run policy evaluate

policy improve policy where do models fit in so recall that a model of the

environment lets you predict what’s gonna happen next you can use that for

pretty much any of these while you’re running your policy before you take an

action you can stop and imagine what’s gonna happen if you try many different

things you can create partial rollouts that you can use to evaluate your

different choices and then you might pick something different than you would

have otherwise so that’s maybe where it can appear in running ball and running

the policy in evaluating the policy you can use that same kind of approach of

just simulating look-ahead data to help you get a maybe a more stable backup for

your q function or just use some kind of Monte Carlo tree search style algorithm

where you’re going to propagate Q values back and figure out like an average case

Q value and then for improving the policy you can regress your policy

network if you have explicitly represented one towards whatever the

outputs were from that look-ahead planning process so if you have a model

it’s very powerful you can use the a lot of different ways you can embed it

pretty deeply in into RL the problem is that models are very hard to learn and

you usually don’t have them so let’s say you have just made a wonderful brand new

complex physical robot unless you have a lot of hours to spare and control theory

expertise you probably do not know how to fully characterize that and have a

simulator model which is going to be accurate in any reasonable way certainly

not accurate enough for training it in simulation and then directly applying

that simulation trained policy into the real world you may want to try learning

a policy from data but this can be quite tricky although there are some really

exceptional success cases but because yes uh yes you could make that argument

so I let’s say hardness to learn is not a fuck oh I suppose sorry the question

was can you make the same argument for value functions and I would say that

hardness to learn in this case should be interpreted more as has the research

community figured out really robust reliable standard methods for doing it

yet but not necessarily whether there’s some intrinsic quality of hardness

finding the correct model is a supervised learning problem if you have

enough data part of the problem in RL is that you usually don’t have enough data

and you would have to get it by interacting with the environment and

there may be areas in the environment very critical to decision making which

you’ve just never observed yet so imagine that you are in a giant maze and

you can try to learn a model of the maze as you go but until you’ve seen the exit

your model does not going to be very helpful for you and navigating except to

help you perhaps avoid repeating places that you’ve been to already but but yeah

in practice models tend to be so far hard to learn so let’s look at maybe one

case study in ways that you can use models so this is the case of planning

and/or expert iteration the basic idea is that you’re going to use your model

from a current state to look ahead into the future and help guide your decision

about what action to take so in planning you might explicitly just

base your decision about what action to take on whatever the output from that

look-ahead process is and your current value function in expert iteration

you’re not only going to do that but then you’re also going to have a

explicit representation of a policy which you’ll try to improve by

regressing it towards the output from the look-ahead process so as a case

study consider alpha 0 alpha 0 is an algorithm which has succeeded at

achieving superhuman performance in a wide variety of complex 2-player fully

observed strategy games particularly chess go and shogi so this was a

successor to alphago the algorithm that beat human grandmasters and go and alpha

0 at the algorithm level is sort of beautifully simple you have a neural

network that emits two things a probability distribution over moves to

play P and a value network that says basically whether or not you’re gonna

win or lose B and you learn this with this very simple regression approach

where you’re gonna move the value function to be more like whatever the

true outcomes from games work and you’re going to update the policy by using a

model-based look-ahead operator to figure out what a better policy would

have been based on your current policy and value function and you’re just going

to move your current policy towards that and then there’s also some

regularization very straightforward and the look ahead is done with Monte Carlo

tree search so that’s just stochastic lis considering different possible

outcomes and then aggregating data after having done partial rollouts down the

game tree to figure out what would have been the best thing to do so this is one

model-based approach now this required having a perfect model of the

environment and in games like chess ergo this is feasible because you could fully

Express the rules in a way which is easy to compute and forward simulate

and you don’t have to learn anything from data and you also don’t have

anything which is partially observed so your model doesn’t have to do anything

fancy to keep track of what’s going on in the background very straightforward

and this kind of approach can be very very powerful but the problem is that

most conditions are not quite as ideal as this so another family of approaches

is where you’re going to use the model for policy evaluation so let’s say that

you have learned a model or perhaps you’re given one but more often than not

for these algorithms you’re trying to learn it concurrently with experience

you learn some models and then you’re going to have the agent quote dream in

them the agent will sample a bunch of fictitious trajectories inside of the

simulator and use those as the basis for a policy improvement step and algorithms

that are like this there’s model ensemble TRP oh and I want

to say Mehta policy optimization or model-based Mehta policy optimization

then you could also instead of using this for computing advantages and and a

policy optimization style improvement you could use this for Q learning as

well where perhaps instead of forming the target based on the bootstrap which

might be inaccurate on particular regions of state action space that you

haven’t visited you could use the model to simulate what the bootstrap might be

in those cases and use that as your backup for Q learning so that’s an

approach called model-based value expansion and these algorithms the gain

that you get from doing this is ultimately in-sample efficiency so what

happens in normal deep RL is that you use tons and tons of data from

interacting with the environment to try to improve your policy or your q

function and you make progress at whatever pace when you use the model and

you offload a whole lot of the improvement steps on to experience

collected in the model that frees you up from having to have collected that

amount of experience in the real world as long as your model is good enough if

your models not good this won’t be very helpful but if it is good and if you

only needed a little bit of data to train your model then you can get a lot

of mileage out of it and your overall RL algorithm will have used less

interactions with the real environment and otherwise this is great for cases

where interacting with the real environment is very expensive so for

instance if you want to train something on a physical robot that can be an

expensive process the robot might be slow the robot might break the robot

might have all kinds of things where it’s difficult to get it to do that or

it’s difficult to reset it you probably don’t want to have to spend that many

man-hours waiting around for the robot to finish its learning procedure so if

you can offload some of that time into simulation then it makes life better

yes is that what you would apply for self-driving cars that’s a good question

so I’m not actually all that familiar with cases where self-driving cars have

fruitfully made use of deep RL that’s not to say that they don’t I just don’t

know I would imagine that in self-driving cars it’s probably more a

matter of collecting data from experienced human experts and then using

that data as the basis for learning a behavioral policy but I’m also happy to

you know go through this later and see what we can find in the literature yes

what would model-based RL be more geared towards transfer learning I think it

could potentially be quite helpful so certainly when we think about trying to

get robotics to transfer from say simulation to reality you know we want

to make sure that the model used in simulation is high fidelity with respect

to reality and if that’s the case then this model you can think about sim to

real as sort of a model-based approach and perhaps it’s gonna be very helpful

all right and then there’s this other completely orthogonal way of using

models which I’m really fond of because it’s just sort of weird which is that

you actually take the model and embed it inside of a model free agent where the

model is going to receive inputs from the from the environment and use that

with some internal process of perhaps imagining some futures and then

transforming whatever representation and has of those futures into something

which then becomes side information to the model free agent so you train the

model separately from the agent the module that provides some information

based on the model to the agent is sort of decoupled from it except that however

it’s going to process however the model free agent will process that information

is based purely on the model free learning so this is an approach called

imagination Augmented agents I think this is really interesting and really

neat I’m not aware of a whole lot of follow-up work from when this came out I

want to say last year or the year before but I just think that because it is so

different from the other model-based approaches that’s cool whenever there’s

something different it’s cool all right that takes me to what was originally

intended to be the end of part one but it’s now the end of both parts thank you

so much

at this point I would like to turn over the mic and the stage to Matthias

Clapperton who is a researcher on the robotics team at open AI and he’ll be

presenting on the work on the robotics team for learning how to do complex

manipulation with deep reinforcement learning on a real physical robot great

thank you we have a computer suite

yay I think it works okay thank you cool so hey everybody my name is Matias

as Josh mentioned I’m super excited to be here and talk a little bit about what

robotics that openly is doing and to talk that I’m going to present this call

it’s called learning dexterity as I mentioned this is basically the effort

of the entire robotics teams for many months so everything I’m kind of talking

about is not just my work but these are robotics teams okay cool

so let’s maybe start with talking a little bit about what robotics at open

era is actually trying to do and the ultimate goal I guess robotics at open

eye has is suppose some form of general purpose robot so I think this kind of

picture illustrates as well very well we have human-like robots today and we know

that humans can do a very very large amount of different jobs and skills so

that can include things like cooking it can include things like actual labor in

some form of agricultural thing maybe it’s very precise kind of things

like surgery or building things and putting things together in this kind of

stuff and ideally we would like to have a robot that has a similar similar level

of dexterity and a similar level of well general purpose Ness if you will the way

robotics looks right now it’s very different from that so we have these

kind of very specialized robots so an example I think that is good it’s the

Roomba which is on the lower in the upper left corner here that can clean

your house but it can only clean your house it can only vacuum your house and

similarly your things like self-driving cars which to some extent also robots

that are very good at one thing which is driving themselves but they cannot do

anything else and the robots there are more kind of versatile and more

complicated they are either very often controlled by humans so an example for

that would be doing surgery so we have robots that can assist humans in that

but they’re always controlled by human operator which is a surgeon or we have

more complicated robots in factories but those are typically just programmed to

basically blindly execute a given show secretary so someone sits with the robot

and figures out how to do a certain process in a factory and the robot is

very very stupid and has no idea what’s going on so the question of course is

how can we kind of step away from that paradigm and how can we have robots that

work in an actual physical world and aware of their surroundings and given

that this is the spinning out workshop that’s concerned with ll it’s not so

surprising that we think RL may be a good approach to that and we know that

RL works really well in certain domains so I’ve picked out two examples here

that probably most people have seen on the left side we have alphago zero

playing against Lisa at all and a game of Go and as you know alphago zero won

this game in fact I think one almost all games

that it has ever played and the follow-up versions of alphago zero

beyond beyond human capabilities when it comes to playing go similarly we have

dota 2 so this is some of the work that the dota team at opening AI has been

doing for a while we have this door abort called opening at five that is

very very good at playing the game dota 2 which is a 5v5 multiplayer game and it

is approaching like professional levels so it’s it’s consistently winning and

can semi-pros and we are already playing against some pros in fact we’ve done

that last summer at the International unfortunately we have not yet won

against those pros so the question is how does this work in robotics and of

course yes like a lot of work in this in robotics it’s not like we we are the

only ones doing this and I just like to give a bunch of examples that I think

are kind of illustrating what people are typically doing today

the first approach here is somewhat reason it’s from 2017 and I think it

looks really cool so you can see the agent is even able to use certain tools

so in this case a hammer it can open doors it can do all sorts of things

the unfortunate thing here is that all of this looks really cool about it’s

only in simulation and ultimately in robotics it doesn’t really count if it’s

only in simulation because you want the physical robot to do something otherwise

it’s not very useful so the other approach that people have

been taking is to train on the actual robot itself so this is a some work from

2016 where people have been doing dextrose in hand manipulation so the

goal of the robot here is to kind of manipulate this this tube filled with

coffee beans for some reason into a target orientation and they do all the

learning on the on the actual robot and that of course has the advantage of not

having to do any form of transfer because you learn on the robot you

exactly know how the robot is going to work and once you have a good policy

you’re done the downside of that of course is that well you have to run on

the actual robot so it kind of breaks a lot on you it’s very slow to do you

can’t really scale this up unless you get a lot of robots which is actually

something that people are doing so this is the approach thing by Google and

typical Google fashion scale it up so just get a lot of robots and let them do

it for two months in parallel and then you can suddenly train on the robot

because well you have 20 of those doing it in parallel and it can do very

meaningful stuff so in this case they have learned to grasp arbitrary objects

out of this kind of box that I have sitting here and this is actually very

impressive demo like this kind of been picking stuff is actually very hard the

thing is still that obviously this does not really scale all that well because

this is a relatively simple task yet you need 20 robots going for two months and

you will also just have to babysit the robot all the time right like you’ll

have to repair it when it breaks you’ll have to kind of reset the environment

when certain objects fall out of the bin and all of this kind of stuff so it’s

just a lot of work so what we’re trying to do is to kind of combine the benefits

of those two approaches so training in simulation and then transferring to the

physical world which is called sim to real and I’ll be talking a lot more

about this but before I do that I’d like to introduce you to the test that we

actually have in mind when we when we do our research so we decided to do

dextrose in hand manipulation and the reason for that is that it is first of

all very hard to do and then second of all

it is something that we’re interested in because we know that our hands these

universal end effectors right so human hands are very versatile in what they

can do it they can be very dexterous you can do an cooking thing or you can

operate on a human if you’re searching at least but you can also do very heavy

lifting with it and you can use tools made for you nuts hands and these kind

of things so so this is basically the motivation for why we choose this kind

of hand and just kind of tasks because it’s hard and because it’s also

ultimately useful for the channel purpose robot we would like to build and

the reason why it’s hard I think is summarized relatively well in this this

kind of slide so we use a hand called the shadow Dexter’s hand which is

depicted in this picture it has 24 joints and it has 20 actuators so what

this means is that your policy it and every time set has to produce an action

for 20 individual actuaries and it actually has to coordinate right like

you’ll have to have different joints work together to do certain things so

it’s a really high dimensional kind of control problem that’s typically well

out of reach of what traditional control problems can solve as I mentioned

ultimately we wanna run this on real hardware and so we have to work with the

real hard way and all its flaws and issues so this includes things like

noisy and delayed sensing so that’s just a fact of physical hardware systems

right like they will not have perfect information and they will have delays

and certain certain quirks that you kind of have to deal with the other issue

that comes out of this sensing is that you actually have to handle partial

observability so in simulation you have perfect knowledge of everything that’s

going on because well it’s your simulation and you can just read out

from your simulation what the current state is but on the physical system you

can only use what you can actually sense so obviously certain things like the

friction for instance of the system cannot directly be observed and then

last of all this is actually super hard to simulate as it turns out the reason

for that is that you have a lot of contacts going on so if you have

something in your hand like you kind of constantly touch it and contexts are

notoriously hard to model accurately first of all and then the hand itself is

also incredibly complicated so it’s 10 actuated which means that you kind of

have tendons pulling and just causes a lot of unmodeled kind of things in you

and your hardware that you have not modeled in simulation cool so as I

mentioned we set out to solve this problem with our seem to real approach

so we trained in simulation and then we transfer to the physical hardware and

while this sounds very easy it is not very easy because the transfer problem

as you’ll see is actually not very easy to overcome but before we talk about

that let’s have a look at what what we can do in simulation and what the policy

that we train looks like in simulation I think this also illustrates the task at

hand so that you can actually understand later what what the robot is trying to

do so as you can see you kind of have this block with colored faces and the

task is to rotate this block into the desired target orientation that you have

and the target is depicted as this kind of like semi-transparent additional

block on the right hand side so now it’s trying to bring up the blue face a yeah

it got it and then kind of moves on to the next goal and as you can see it just

kind of involved like it coordinating its fingers it has to kind of use its

permit it’s kind of using gravity to let it roll and it’s like even in simulation

this is not super easy to learn the hard way itself looks like this so this is

the cage we call it it houses all sorts of things in the middle of course you

have the shadow Dexter’s hand which is the robot itself and then you have it

surrounded by quite a lot of these face based tracking cameras so we have 12 of

those in total and what they do is they provide you with relatively accurate

sensing in in Cartesian space so we have LED markers on the hand itself so we

know where the hand is and we also have LED markers on the object so we know

where the object is and those guys basically they sense the slide of the

LED and since multiple cameras can kind of see the same LED marker that it can

do triangulation and you can recover the position in in space from that

information we also have an alternative setup

because as I mentioned ultimately we’d like to have something that’s more

general and having a motion capture system is not very kind of real-world

like so we also have RGB camera so those are regular RGB cameras we have three of

them surrounding the scene and they can also be used for sensing in fact they

can be used for post estimation of the object so you don’t even have to have

any any special kind of sensing on the object itself the cameras can do it for

you and the reason why we have three is just just so they can first kind of

recover depth information and then second they can also kind of work around

occlusions because it’s in the hand from certain angles you cannot sometimes see

the object because it’s kind of covered by the hand so this is how it looks up

close when we run things so as you can see we have the we have the hand with

the block in it palm and in this case it’s the block that we use for face

based tracking so you can kind of also see the LEDs on it that we use this is

simply much easier to do when when kind of testing these algorithms so we have

these kind of world setups all right so the big question of course is how do we

do the transfer so I showed you a video of the policy doing its thing in

simulation and I showed you the physical Hardware so we can have all the building

blocks but how can we actually transfer it to the physical robot and if you just

train it in simulation it will not work at all it’s the short version so I’ll be

showing some kind of numbers for that as well but there you can believe me if I

say the transfer problem is really the core issue that we’re dealing with here

and the approach that we’re taking is relatively straightforward actually so

what we do is we use two main techniques the first one of course being

reinforcement learning to learn the actual control policy and then the

second technique being the main randomization to make sure that the

learn control policy actually transfers to the physical system and I’ll be

speaking about both of those in a little bit more detail so let’s get started

with the main randomization so this is a technique that has been used for a

little while pretty popular paper when it comes to

this is from 2016 in this paper what they did is they learn to fly a drone

and the way they approach this is they trained in only in simulation using

these kind of randomized buildings so you can kind of see it has a lot of

different rooms in it the textures are very different so the walls look

different of ceilings of floors and they train a drone to fly in all of those

rooms and what they then do is they take this drone that was only ever flying

inside a simulation and show that they can actually fly another completely

different actual room simply because it kind of has seen all of this variant

doing during its training it kind of like from its perspective what happens

is that the policy think so justice is another like randomization it’s kind of

weird but oh well I know how to handle it so it flies in the actual room and

people that open either has been using similar approaches as well so this is

some work from my colleague Josh Tobin what he has been doing is he has been

using domain randomization for grasping so this is using a robot called the

fetch so it’s you’ll see a better picture in a

moment but it’s basically a a simple robot armed with a parallel group at the

end and what he would like to do is pick up these objects that you kind of see in

these randomized scenes and by basically using the same approach so he’s

randomizing all sorts of things like the looks of the objects of shape of the

objects the background the color of the table as you can see he can then use

this information or this training to transfer to the physical robot even

though it has never seen the actual physical table and what was pretty

surprising in this research is that it turns out you don’t even need

photorealistic rendering so as you can see like this it looks not realistic at

all it’s like pretty computer graphics and and still it transfers to the

physical to the physical world so the important thing here is that you have

this variety and not necessarily realistic environments yeah yes so all

of the the two approaches that I showed are using

using vision to learn a policy yes in this case I think it’s actually not

using the vision to learn a policy directly I think it’s instead just

predicting the location of the object and then there’s a policy that the

Kinect can grasp it from that so some some other work in this domain which i

think is equally important is physics randomization and this has been done by

Jason pang who used to be an intern at open air in 2017 and he’s basically

using the same idea of randomizing but now for physics instead of visual

appearances so it’s kind of hard to like visualize what’s going on but what the

policy in training sees a certain worlds that are just different so maybe they

have different masses maybe they have different frictions of the table maybe

the robot itself behaves differently and so on and so forth and what he was able

to show is that this again is sufficient to train strictly in simulation and then

transfer to the physical robot so the test at hand here is again with the

fetch robot and it’s trying to move this this park to the goal location which is

marked in in red and on the left hand side you see a policy that has been

trained with those physics randomizations and on the right hand

side it has been trained without and as you can see obviously the one on the

left hand side does a pretty decent job it’s like relatively precise it can push

the park where it wants to go and the one on the right kind of freaks out so

it shakes very violently in fact the building was shaking when he was

deploying this and it cannot really do with the job and the reason is that it

well has kind of over fit to the simulation which simply is not fully

accurate even though it’s calibrated to be close to the robot and then it

doesn’t generalize to the actual physical world where’s the one with

physics randomization stars okay of course so that’s the main randomization

in a nutshell so both the visual randomization and the physics

randomization yeah

yeah it’s it’s not very realistic honestly I mean it’s realistic in the

sense that it’s the physical so if you randomize too much your simulation will

become unstable because you’ve set in certain parameters such that they cannot

make sense anymore but it’s not very realistic like the

masses will be very high sometimes it’s like smart to move the puck and it’s

more about diversity again yeah okay cool

so I’ll now speak about our approach so what I previously talked about was

mostly other people’s work even though they’re also in the robotics team but

this is the the learning dexterity approach that we took so again remember

the goal is to have the shadow hand rotate an object in hand and to kind of

start it off I think it makes sense to just give you the the overview of the

entire system and then we’ll kind of dive into some of them details after

that so again as I mentioned everything we do is only in simulation so we never

see the actual physical robot until we run on it like we’ve never seen it so so

the way it works is that we collect a lot of data in simulations so we have

many many simulations running in parallel which is kind of depicted here

in box a and all of those are randomized which is kind of visualized by them

having different visual appearances but also think physics randomizations so the

friction and the masses will also be randomized and using this collected data

we basically end up training two different networks so one of them is a

policy and the other one is job is a vision network because we’d ultimately

like to run this from vision alone without the face base the policy network

is what is depicted in Box B here and the way it works is that it takes the

observed robot state which is the position of the five fingertips so you

have doting coaches in a space of 15 dimensions in total so it knows where

its fingertips are and then also the pose of the object so that means just at

the orientation and the rotation in space sorry the position and the

rotation in space and this information is then fed into an LSD and policy so

it’s a recurrent policy and it produces the next action and we train this in

simulation using reinforcement learning the second network that we have which is

actually distinct they are not end-to-end

this is two networks that we train separately it’s a vision Network and the

rate and vision Network uses works is that it takes three different images so

remember we had these three RGB cameras surrounding so images rendered from the

perspective of those but again only in simulation and then using a

convolutional neural network predicts the pose of the object from that

information from those images and again this is only trained in simulation when

it comes to actually deploying this to transfer as you can maybe kind of guess

is that we can combine those two systems to get us what we ultimately would like

so you use the actual cameras to sense the position or the pose of the object

using the vision network so you feed it into that and then by having the object

pose and the fingertip locations you use your L SCM policy to produce actions and

that allows the robot to basically see what is going on and react accordingly

and of all only being trained in simulation yeah potentially honestly we

have mostly used this approach because we knew it worked from previous research

it is almost as accurate as face base and face base is very very accurate I

think if you spend a lot of time you could probably develop something with

more traditional methods I don’t question it but like we would like to

have something that’s more general again and having a convolution that conversion

neural network - it seemed like the most general approach we could have yeah yeah

it’s kind of interesting so ideally you would just use whatever the robot has as

joint sensing so it knows it should know what its own joints are as it turns out

the sensor in the shadow hand uses hall-effect sensing which is a magnetic

kind of sensor and they interfere quite a lot so if you think as a close

together you will actually not know where your fingers are so that’s the

reason why we don’t use it we would like to use it but it turned out to be not

precise enough for what we ultimately wanted to do so we couldn’t actually

rely but yeah you’re right like like this is

more for more for work around like ideally the robot should just tell us

what the joint positions are and then we wouldn’t need the fingertip positions no

it actually has very limited information it’s very surprising that it works like

that yeah yes yeah yeah very good question this is there’s a lot of debate

about this I don’t think it does we have some indication that it doesn’t in fact

it seems to help like the performance seems to improve over the board like we

have certain ways of measuring symptom transfer and when we randomize more we

tend to get better performance on all the environments so I don’t think it’s

it’s compromising actually I think it’s more of an adaptive policy but then

there’s people who disagree so it’s currently a little bit unclear okay cool

so as I mentioned we need to randomize and of course we use appearance

randomization so this is only for the vision Network so this is basically what

I’ve described before just for our setup so you can kind of see we have three

different cameras showing the same scene and we randomize this scene quite quite

heavily so the robot changes its color the background changes its color

importantly the block itself stays mostly the same because it actually has

that color like you cannot randomize the dye but rarely but we changed the

material of the of the block as well so it looks slightly different and then we

of course have that vision network which again is relatively straightforward so

the way it works is it takes those three camera images then uses convolutions and

the rest net architecture and spatial softmax to kind of process them and then

simply calculates all the things and produces the final object position on

object rotation so the pose of the object and this is simply trained with

supervised learning because in simulation you actually have perfect

ground truth which is another very convenient thing you actually perfectly

precisely know where your your object is you have not to actually sense it at all

and this is what the model actually sees so it’s actually I think very

interesting because it looks very very different from

randomization and yet it generalizes to that simply because it has seen enough

variety that it’s kind of okay with with yet another variety that’s kind of weird

but still within distribution in that sense so when it comes to the physics

randomizations that we use we randomized quite a lot of things as well so we have

things like object dimensions for instance we have things like masses

obviously and then mostly things about the robot itself so things like the way

we actuate the robot things like damping within its joints and all of this stuff

and the reason for that is that it’s actually very hard to measure this so

another neat thing is that you can in this physics randomization actually

account for your uncertainty so for the object dimensions we know those with

relatively little uncertainty because we can just measure the dimension of the

block but things like the actuation we learn much less about and so we kind of

widen the randomizations for those and another kind of cool thing is that we

randomized the gravity vector which may seem a little bit weird but it basically

amounts to like when you when you mount the hand it’s not perfectly parallel to

the to the floor like it will be slightly angled because of imperfections

and by randomizing the gravity vector you kind of get this effect as well like

it’s sometimes slightly angled and it turned out to be actually very useful

and then we of course also have noisy observations and noisy actions simply

because it’s a rare reality of the of the physical system the policy is very

very simple so what it gets is the noisy observations so that’s five fingertip

positions and the poles of the object and the goal so it’s knows what it wants

to do and then we normalize a little bit so this is just making sure that things

have a zero mean and unit variance and then use one fully connected value layer

and one lsdm to produce the actual distribution and from that we sample and

then perform perform that on the robot so it’s a relatively shallow and

relatively small network over all the more so yeah

they only come in through the simulation they cannot be observed directly so

sorry

they are simply set in the simulation so the environment has been changed but the

policy cannot sense this directly it has to infer this basically because on the

physical robot it also cannot sense it like we don’t know what it is on the

physical system so it basically what what we think it ultimately ends up

doing is some form of system identification so when it’s running it’s

implicitly inferring certain information about the environment and then using

this information to kind of adapt itself accordingly yeah sorry I couldn’t hear

yeah so so we add Gaussian noise to the observations and to the actions yeah all

right so I think I’m running a little bit late actually how bad is this huh

okay then we have to hurry a little bit cooler so disappeared of training let me

speak about this and then I’ll show a video so disappeared training I think is

very interesting because we use basically the same system that the dota

team uses as well so we have a very large-scale kind of system and the way

it works is that we have role of workers who generate a lot of experience and

then we have an optimizing machine that’s kind of using this information to

update its policy and we use approximate policy optimization for that so a non

policy algorithm as I think josh has explained earlier today and I think it’s

kind of cool that we use the same system estera let me skip over some things but

I think I’ve want to show this so this is when it’s running on the physical

robot as you can see it’s using vision so there are no markers on the actual

object the robot hand is doing all of this this is not cut in any way it is

not sped up again the goal is depicted in the right corner here so it will try

to get the e face front and the end face up top and it will get 250 successful

rotations in this case so it can do quite a lot of those and it can run on

the on the physical system and if I have enough time one one kind of final thing

that I think is actually very interesting is that it actually learns

certain strategies that happen to have names so we have thing a pivoting where

you kind of like use two fingers to create a rotational axis and then you

rotate around that and things like finger gating and the reason why they

have names is because they are used by humans as well and they have been kind

of studied very well they emerge automatically in our case so we have

never shown the robot what a human would do it has kind of discovered that itself

and the reason why they come up is simply because it has a human-like

morphology right like it has a human-like hand and it just turns out

that these strategies are equally useful for humans and robots but they have kind

of been rediscovered quote-unquote which i think is a really

thing so I wanted to mention that and yet we have some qualitative results

that show that randomizations are very important so if you don’t randomized you

get no successes if you randomized you do it turns out memory is very important

so you need an LLC M you cannot simply have a feed for policy and you need a

lot of experience so for the final policy we use a hundred years worth of

data so imagine doing that on the physical robot like probably not such a

good idea so but we can get away with it because we use simulation so we do all

of this in 50 hours and I think with that I have to close all right thank you

great thank you so much Matthias we’re gonna switch out the slides and then

please welcome to the stage the leader of the safety team at open AI dario

Amadei all right just a minute to get the slides

right

very good thing that you’re ensuring that computers in the future will not be

as malicious so I work on a team at open AI that thinks about making AI systems

do what humans want them to do which is you know kind of very central

to open the eyes mission and you know which which we think of as you know

something that our focus on distinguishes us from from other

organizations we think it’s very important particularly as systems get

more capable to ensure that they you know both in a narrow and broad sense

benefit society so this workshops called spinning up in

in deep RL so it’s useful to step back and you know think about what is what is

RL accomplished in the last couple years and where is it going so you know this

is actually out of date we should add add a couple things to it but you know

if we look at playing games like go if we look at for about a year ago multi

agent behaviors where you can use RL and self play to train agents to sumo

wrestle each other off a pad we are able to play competitively against

professional professional players in dota 2 the robot results which you just

saw and you know we should probably add just in the last week or two you know

the results we’ve seen on StarCraft which is you know in some ways similar

to dota but just a different kind of game with the different kind of

properties and yet you know that shows that these techniques are really are

really pretty general and are are advancing pretty quickly so you know if

we step back and reflect on you know kind of where are things going you know

some properties that we could point out of these RL agents that are becoming

more and more true right that we’re not true five years ago but are becoming

more and are true we have systems that have an

extended interaction with complex real-time environment they have a very

high level of autonomy and speed you can imagine systems like this in the real

world being used to make decisions faster than humans can intervene or in

more complex ways than humans could you know could hope to understand so

regulating the economy or financial system managing large networks of

computers these are the kinds of things that as RL technology matures it will be

better and better better and better able to do and you know the these systems

unlike supervised learning systems and unlike in any interesting way you know

the simple RL systems were a few years ago these systems are able to teach

themselves and discover their own strategies and in many cases they

discover non-trivial strategies you know just like we saw with the robot it kind

of recapitulating a lot of strategies that humans use you know we see in go

and dota and Starcraft a lot of human strategies that have names you know the

RL system discovers and recapitulates but it also sometimes discover

strategies that a human would never would never have thought of so if we

look at what these properties mean together one thing it means is that the

connection between us as designers specifying what we want the system to do

and what the system actually does in theory the system does in theory if

everything is done right the system does what we want but that that rope it’s

longer it’s more afraid it’s more tenuous

than for just kind of less less autonomous systems that we’ve we’ve

designed in the past and there are many ways relative to you

know simple computer systems or machine learning systems like supervised

learning for for these systems to go wrong and so a couple years ago several

people on most most of whom are now now now constitute the the open a nice

safety team started started thinking about this you know we’re worried about

current systems worried about tomorrow systems eventually we’re worried about

you about about building general

intelligence and what that what that will mean for the world and making sure

that those systems are safe so you know we wrote kind of a position paper and

this kind of started us thinking about you know the directions and how to even

think about this problem of you know do systems reliably do do what we want them

to do and the the kind of general framework and division we came up with

was okay so you know let’s let’s let’s narrowly scope the problem we’re not

we’re speaking not about kind of wider or societal impacts although those are

also important but you know just narrowly the designer had a clear thing

they wanted the system to do and then you know the system gets trained it gets

deployed it goes through some long process actual system fails at this

catastrophic ly and we kind of divide it up into into a couple things one is you

know you’re you’re giving the system some Direction some objective function

that it learns from like the reward in RL there are ways for that to be subtly

wrong and you can get spectacularly wrong behavior if that happens you might

have the right objective function but your system has problems with robustness

doesn’t generalize well it you know exhibits on exhibits unpredictable

behavior as its learning it does dangerous things even if the final

policy it’s gonna learn makes sense and then as a reminder that you know there

like this all exists on top of kind of software

implementation that has bugs in and of itself and so you know these the a and B

are new but they’re layered on top of the general just the general

unreliability of software so kind of a useful way to think about let’s put CSI

because it’s not really a machine learning problem or just you know a

reminder that this is layered on top of existing problems but a crude analogy we

can make is you know it’s a bit like the simple statistical concepts of bias and

variance right better a better objective function you know that’s that’s about

reducing bias and making sure your aim for the right target robustness is is

about making sure that you’re narrowly cluster around the target and that you

always get what you’re intending to get so we’re interested in in both problem

because I have limited time I’m going to talk about our work on the getting the

objective function side right I think you know open AI does more the opening I

safety team does more of that relative to other you know other teams say at

Google brain or deep mind that that think about these problems and so I’ll

mostly talk about that but increasingly and maybe I’ll have a little bit of time

to talk about it at the end we’re also thinking about the robustness direction

and how these two things interact so just to be clear about what we mean I

think this this this this video has been widely circulated so I apologize if for

people who are already familiar with it but you know about about a year and a

half ago we you know we were we were training lots of flash games using RL

and you know there there happens to be this boat race game so you know I I just

set lots of lots of games running with with a reward function so the way this

boat race works is supposed to go along the course and you’re supposed to that

you’re supposed to finish the course but the way the reward function works and

it’s hard to reach in and write a different reward function is you get you

get points for you know these markers along the way that are mostly along the

course but it turns out there’s this this little Lagoon in the corner of the

course where you can go around in circles and get more and more power-ups

and that turns out to get you a faster rate of power up to naturally finishing

the course there’s nothing wrong with RL here the system did what it was supposed

to do but it identifies the weakness of the connection between a reward function

in the final behavior the reward function that you specify that you may

think corresponds to some behavior that you want may in fact correspond to very

different behaviors and you get no feedback on that other than just finding

out what the system does right when I first trained this I trained along with

a bunch of other games two days later I looked at this I’m like

what what in the world has what in the world is this doing it doesn’t make any

sense and then I thought about it alone I’m like oh of course that makes sense

and you know so the more powerful the system is the more autonomous it is the

less of human is paying attention to it the more potential there is this is like

you know generate dozens of these examples but

you know robotic system where we forgot to make the table totally fixed it has a

high mass but it’s not fixed turns out to be easy it’s hard to send the send

the puck exactly to the point you want it to be it’s easier to send the puck

observe if it’s gonna be a little to the right or a little to the left and then

nudge the table so that it hits it exactly it’s very it’s very clever it’s

a correct solution to the problem but the problem was not the right problem so

the general approach that we’ve kind of hit on and we’ve been pursuing the

strategy for about a year a year a year a year and a half is that the this

training loop is too long right the human at the beginning says here’s a

mathematical roared function like go go optimize this then you look back at the

end of training you might get the right thing you might not if you don’t you

have to go back to the beginning or you know maybe the system is already doing

something dangerous so maybe we should have humans be involved interactively in

the training process right when we train humans to do things it’s not just like

here’s your goal go off tell me what you did you know two weeks later so if we if

we do this is there a way that we can use a human to decide what the reward

function is in a continuous way that’s more reliable that’s more naturalistic

so that the system ends up imbued with human goals and values but it’s able to

act faster and bigger than human scale once it’s trained it knows what the

human wants and it does it example of this is like instead of RL we can learn

from demonstrations but that kind of has the same problem a human demonstrates at

AI system copies it and there’s kind of it’s hard it’s hard to do better than

the human it’s hard to course-correct it’s hard for the human to say you

should be doing this instead of this and traditional RL has has a loop that’s too

long so the kind of first effort we did in this direction was we called it deep

RL from human preferences so the idea is you know I want this thing to do a back

flip and I you know it’s hard to mathematically specify the reward

function for a back flip we tried by looking at all the

individual joint angles and you know it turns out it just gives you some

think very very like you know very awkward looking but what we do instead

is and you know this is now running for the second time but a human looks at the

behavior of the system and says which of these is more like a backflip than the

other the system just starts by acting randomly it has it has just like a

random reward function and human gives it feedback on what what is more like

what the human wants and then the AI system you know like the the RL system

has a reward predictor and it tries to fit a reward predictor consistent with

what the human says the human prefers and then in the background it’s running

a whole bunch of copies of of the RL environment and those copies optimize

the reward function that it learns from the human the human only ever has to

give feedback on a very small fraction of the AI systems behavior doesn’t have

to see everything it does just has to get enough samples to give the to give

the policy an idea of what the reward function should be so another way to put

it is the human trains the reward function and the reward function trains

the RL system so what I just said can be kind of pictured in this the grey part

is the standard set up for for for reinforcement learning where you have an

RL Aughra than the environment they exchange observations and actions and

there’s a reward that kind of that kind of you know comes from the ether that

was ultimately specified by a designer but that isn’t thought about as being

part of the problem here what we have is that reward starts out being completely

random and the human sees examples of the agent’s behavior and feeds them to a

reward predictor so the reward predictor is changing and improving and adapting

over time and the RL system is both learning from the existing reward

function and adapting to changes in the reward function

so we did several versions of it in our paper and we found that a simple active

learning technique helped relative to random it didn’t help by that much but

but it helped the idea is you train an ensemble of reward predictors that are

trained on subsets of the data and that that allows you to have kind of like

semi independent predictors and you can pick examples where the predictors are

uncertain meaning that those are parts of the space or situations where there’s

just the reward predictor has more uncertainty and so would like more

feedback from the human that helps you can go much more sophisticated in that

direction right the system could like ask the human like what you know like

you know what what am i doing that’s wrong what am i doing that’s not clear

the human could say to the system like you know I’d like you to produce some

examples of this right and then it becomes much more like a like teacher to

human teacher to human pupil teaching process and a lot of what we’re doing is

kind of going in that direction but we kind of have to start so imitation

learning has the following limitations when you do imitation learning you

except for noise reduction which is usually a small effect you can’t perform

better than the human does so as we’ll see in some future tasks here there are

cases where learning from preferences allows you to perform better than how

the human does the reason for that is with imitation learning you just do what

the human does here you learn what the human wants and once you learn the

reward function you could do it better than the human right so consider

something like you know if I didn’t know how to play go I can teach you the rules

of go and then you can do RL on the rules of go and get much better than me

or you can just copy my moves if you’re just copying my moves you can never do

better than me if I teach you the rules and then you use RL to learn how to

learn how to play you can you can then

you can then in principle do better

another another difference is you tend to get kind of like

better sample sample you tend to get like better sample efficiency you can

come up with strategies that a human wouldn’t would you can come up with

strategies that human wouldn’t have thought of and many tasks a human just

can’t do so actually this backflip task it’s actually very hard for a human to

demonstrate that task like you’d have to get a VR setup and if we look at like

the tasks of the future right where you know like let’s say I want to defend you

know a large corporate IT network or something and I want to respond to

threats in real time that’s just something where I I can’t get training

data from a human I’m asking the machine to do things that a human can’t can’t do

which is what we ultimately want AI systems to be able to do does that kind

of answer the question yeah so we have an option in this paper for basically I

don’t know or I think we had separate options for I don’t know or it just

throws out the data or these two look about the same in which case it like

waits them equally in the predictor and in yeah so that’s that’s easy to

incorporate I think ultimately the communication needs to be in terms of

language and not in terms of clicking left or right and then that will kind of

like make a richer space for doing things and saying I don’t know or like

show me some other examples these things aren’t comparable at all become much

more common so the nice thing about this is given an environment without changing

the code at all only changing what the human provides as feedback you can get

totally different behaviors so in about half an hour a human can train this this

are all system this is like simple simple atari enduro game i can train it

to do the usual thing which is to to race ahead of all the other cars but i

can also train it to go exactly at the same speed as other cars and when it

does that you know it’s able to actually get there very very effectively like you

know stay exactly even with other cars which isn’t it isn’t easy you have to go

at kind of exactly the same speed and match their speed and so

exact same code just the human provided different different feedback one thing

we show is if we don’t give you the rewards for Atari games we just hide

this hide them from you humans giving feedback on basically you know trying to

get the system to get the highest score that it can works really well on the

kind of right of each panel those like colored bars that are moving that

represents how much reward the system is thinks that it’s getting or just how

much how much it thinks to give an action is good so if you look at the

breakout case when the ball hits the paddle instead of so on the Left when

the ball hits the paddle instead of you know instant instead of instead of the

ball going to the bottom it says yeah I got a lot of reward from that same with

pong the when it surfaces to get oxygen and Seaquest it’s very very very high

very high reward level so the predictors seem to correspond to what you know to

what human would say is good behavior which is not surprising because of human

training them so we did did a bunch of we did a bunch of experiments and you

know with fixed reward Atari games your goal is just to do as good as you would

if you knew the reward right so you’re like hiding the reward from yourself and

you’re trying to learn the reward from a human so most of the time it does it

does almost as good but actually there are cases where it can do better

we’re in enduro that the algorithm we used a3c has trouble learning enduro

because of sparsity of the reward but a human actually helps to shape the reward

right in enduro you have to like kind of like read the control stick to go at a

certain speed in order to get in order to get any reward at all so you can

start you can start to move and the RL system doesn’t give you any reward and

then you have to keep moving faster and faster to get reward and some some

algorithms never figure that out but the human will basically say okay yeah you

went ahead you made progress that’s better than when you’re not moving and

so little by little with just with a few feedback points it can lead the system

and so the human can shape the reward and they’re actually cases like the the

curve for enduro in the bottom right where you can actually do better than

the human did or you can actually do better than that

standard then a standard oral algorithm did even though you had less information

instead of knowing the right reward function you just had a human indicate

the reward function also works for a bunch of kind of like simulated robotics

tasks we haven’t really tried it in the real world relevant to the question

about demonstrations we’ve we actually followed this up with an effort

combining human feedback with demonstrations so what that did is you

know there’s some tasks human can do it but we liken it you we’d

like the RL system to do it better however we can initialize from human

human demonstrations the AI system copies that but then on top of that on

top of that initialization we run RL RL with human preferences so there’s no

reward function there’s no like programmatic reward function anywhere

it’s entirely learning from humans but the first step the human demonstrates

and then the second step the AI system copies in the human says it would be

better if you could do it this way and again the second step allows you to

exceed human performance or do tasks that humans can’t do right the humans

like this is as well as I know how to do it the AI system copies that the the

human says you know ok well I wouldn’t be able to do this myself but if you

move back and forth really quickly and shot those two ships that will be better

than if you didn’t do that the AI system is capable of that and so it can

bootstrap itself to kind of beyond beyond human capabilities more recently

and we don’t have any work out on this but I think we will soon we’ve started

applying this to natural language so in the last year or so there have been kind

of big a lot of progress on large language models like open the eyes GPT

and google’s burt where you just take a big corpus of text you train just just a

big transformer model to to predict the next word or the next token and that

allows you to generate very coherent text and can also be fine-tuned to solve

a lot of linguistic tasks so one one idea there is can we find two NAT via RL

from human preferences right I have a language model it’s

a lot of text some of its happy some of its sad five five minutes yeah you know

some of its formal statements or informal statements some of its jokes

the language model maybe has some idea in its internal representation of the

difference between those things but you know if I just sample from the language

model it just kind of gives me random samples of stuff so can i push this

language model in directions and to produce behaviors that only a human can

specify that can’t be specified programmatically so things like

statements that rhyme or our-our statements that are in iambic pentameter

could you make a system that is you know from the logic of learning from human

preferences is a better poet than any human could be or something like this or

you know makes makes like very positive sentiment statements that are you know

that it’s hard to find enough enough positive sentiment statements to to copy

from so that’s the direction we’re going in and then I think you know like a long

term vision for it would be you know we would you know we want a system that

basically has an ongoing dialogue with with a human the human asset to do

something really complicated like planning and executing the mission to

Mars you know the system kind of kind of clarifies ss4 instructions while it’s

learning and while it’s doing the task and we make sure that that things like

pathological solutions the problem don’t happen one way to get to Mars really

quickly is you know to escape from Earth and propel yourself by dropping a bunch

of like nuclear explosions back at earth that would work that would get you to

Mars this project called the Orion project in the 1950s although plan was

to detonate the nuclear weapons when they were like far away from Earth but

this is not a solution we would favor how do we make sure that that that AI

systems don’t don’t do things like that cool so I’ve only talked about a subset

of what the safety team is working on but you know we have around 15 members

here some of some of these efforts were done in collaboration with with with

deep mind and various various academic and

we have a number of kind of interns and faculty affiliates but you know we’re a

safety team is is is continuing to hire and we’re we’re interested in you know

further advancing these and these in other areas thank you so much sorry oh

hello everyone we are now at the conclusion of today’s morning talks but

before we break for lunch I would like to invite all of the volunteers who are

joining us today from open AI and Berkeley and New Haven school to please

come up to the front so as we proceed into the afternoon hackathon and

breakout sessions these will be the faces that will be around to help you

that you should ask questions to these people are all talented researchers or

contributors or engineers in this space many of these people are employees of

open AI ever and we also have I think the only person here who’s not currently

employed by opening I was previously employed by open AI so if you want to

pick our brains about what it’s like here what we do why it matters please

feel free can we just have everyone get maybe a sense to introduce themselves

sure I am Daniel I work on the safety team as a male engineer working on the

language fine-tuning project from a human feedback

yeah Matthias I’ve owned robotics I’m Ethan I’m on the safety team working on

model-based or LM safe exploration with Josh I’m Carl I’m on the games team

primarily studying transfer learning and procedurally generated environments my

name is Dylan I’m a PhD student at UC Berkeley and I mainly work on preference

learning I’m Amanda and I’m on the policy team here opening I I marry I

work on the safety team on safe exploration alright and another thing

that I want to say thank you all so much for being here today something that I

hope we can do is really make this a useful experience

for all of you and I hope that over the course of the day that you know you give

us feedback about what you find helpful and not helpful and what it is that

you’re hoping to get out of this experience so that we can figure out you

know how to help you get to that and and thank you so much please enjoy lunch