Investigating Model Based RL for Continuous Control ｜ Alex Botev ｜ 2018 Summer Intern Open House ｜ OpenAI

Video

Transcript

hi everyone I’m Alexander Botev I’m

currently a PhD student at University

College London and I’m doing her

internship in the robotics team for

investigating model-based reinforcement

learning for continuous control so I’m

gonna chime in some of the things that

Sally earlier mentioned on model-based

control so let’s dive in so this is the

outline of the talk and so let’s start

first I’m going to introduce why we care

about model-based around so I have a

small intro I guess most of you are

going to be familiar but in a formal

learning we have an agent that interacts

with an environment which is different

than standard supervised or unsupervised

learning settings and the way that the

agent interacts is true actions and the

environments gives you back so

observations in a reward signal the

agents try to maximize the cumulative

sum of the rewards over the episodes

that he experienced now the main

difference between a model free

algorithm and a model based algorithm is

that model free algorithms try to learn

a policy PI and most often either a

value function or an action state value

function which try to estimate how good

a particular state in the environment

are for the agent to be in and this is

done solidly true using the reward

signal and experience with the

environment well in model-based RL we

try to learn additionally an internal

dynamics model which essentially two

models how does the environment evolve

and each tries to match what you

experience to the environment and in

general we assume that we don’t have

access to the environment of how it

works except through sampling so what

are the potential benefits of actually

using an internal model at all so to

some extent the kind of grail that you

would have if you have a perfect and

ideal model is that we can solve any

task without ever interacting with the

environment so if you have a perfect

model you can simulate it as much as you

want without interrupting with the treu

environment and you can solve any task

maybe that’s solving the task

might be hard but that’s kind of a

concern that we won’t consider too much

so however in practice we have to learn

this model so one of the benefits is

first your task independent so all of

the policies and value functions that a

model three algorithm estimates are

based on specific rewards so if you

slightly change your task it means that

usually you have to retrain from scratch

if you have a model that’s very

accurately learn on one task and you

change the reward you can retrain your

model free algorithm or do any kind of

other search based on top of it without

interacting with the environment it’s

usually trains in a supervised way

sometimes in unsupervised but the

benefit of this is that it’s much easier

to train much more stable as these

techniques don’t rely on things like

bootstrapping which make training more

unstable importantly we can do we can do

planning with much more sophisticated

algorithms so for instance you can do

trajectory optimization in robotics if

you have a good model you can do tree

search like in going chess and so one of

the very hardly often mentioned argument

is that you use better you better use

your data essentially rather than just

using a single scale or rewards signal

by trying to capture all variety of the

environment you are somehow learning

faster by using more information from

the environment so a wise model based

learning card however so currently most

model free algorithm are actually more

sample efficient and model based

algorithms and have better asymptotic

behavior so here I’m going to show you

two environment basic it’s the same

environment so it’s a very simple you

have a ball that you control with torque

we thread you’re gonna see how does the

real environment look like and with

green it’s the prediction of a model so

on the left hand side we’re gonna

measure just one step prediction so

every state the model is given what’s in

the environment predicts the next state

and then project that

oops can we play the left one so as we

can see one step prediction model is

almost indistinguishable from the real

environment the error is literally very

tiny so things seems to work now okay so

maybe then we’ve solved model based RL

but if we try now to start from a state

and unroll the environment in the model

totally independently of each other then

we get this kind of behavior and like

you can see how far off the model is

going and essentially this is exactly

the same model and then if you try to

train on this kind of model by just

unrolling your model then essentially

the policies that you get are very

suboptimal and don’t sometimes even work

at all in the real environment so some

of the difficulties of training a

dynamics model so first not all aspects

of the environment might be relevant for

any task that you care about so for

instance if you’re a house robot maybe

what’s playing on the TV will never be

helpful but that might be very

complicated to model so you might be

wasting a lot of capacity of your model

on that kind of details essentially what

I showed earlier is probably one of the

biggest issue is that compounding errors

lead to very bad predictions in long

horizons essentially one of the main

issues currently with model-based RL is

that if you try to unroll the models for

longer horizons they start start to

become so far away from the truth that

training on them as long as it’s

important for planning that you have

long horizon goals doesn’t work very

hard to estimate uncertainty for

flexible models so this is in general

problem with neural networks in machine

learning and usually we want to use

neural networks as the dynamics are

actually very complicated and finally so

in terms of the sample efficiency so

ultimata based RL might sound like it

should be more sample efficient

ever in practice to learn a very

accurate model which you can use to do

anything useful your model will require

probably a lots of data so to some

extent here there’s the trade of that if

to get an accurate dynamics model you

require more data then your model free

algorithm requires to learn a policy

then essentially a model based RL

approach will never be more sample

efficient than a model free and there’s

sort of kind of gray area of whether

it’s harder to learn the dynamics or

it’s very easy to learn the policy and

then you don’t need a model so what I’m

investigating during my internship and

main area of research was to investigate

an idea called valve expansion and

that’s mainly for actor critic

architectures so one of the main ideas

of AL expansion is to kind of try to use

the model in order to improve a model

free algorithm in terms of its sample

efficiency instability so in standard

actor critic we kind of have a action

and state coming from the environment

and then a next state sample from the

environment and we try to regress our

action value function to the target

which is usually bootstrapped

so what value expansion does is after we

learn a model denoted in here is Chi

essentially for every offline data that

we have collected we can use the model

to unroll it multiple steps and this way

we can get on policy targets so we don’t

need any Corrections for instance that

you’d usually need with important

sampling and we can get multiple step

horizon targets using unrolling the

model now this of course realized that

the model stew is reasonably accurate to

work but essentially allows us to get a

much more stable or targets and much

better potentially by having multiple of

them and now I’m gonna show you some

results and some conclusions from my

experience in trying all this in

robotics so the first environment

that evaluated on was the standard fetch

tasks so these are essentially

environments in a simulator where you

have a robot with seven degrees of

freedom and it has a gripper so one of

the environment is just moving the arm

to a specific location the pick and

places you have to place a block to a

specific location the push environment

you’re pushing a block on a table to

application the changes on every episode

and the slide which is usually quite

difficult if you have a puck that’s on a

sliding table and you have to just push

it around until it gets to the right

place but there you must be careful that

you don’t overshoot so here I’m just

putting two plots of some of the work

that I’ve done there’s a lot more but I

don’t want to bore you too much so

essentially the black lines are a

baseline deterministic policy gradient

algorithm with hints on experience Lee

replay and double key learning and it

specifically optimized to be a sample

efficient as possible so there was a big

hyper parameter search that I did and

essentially the red and the blue curves

are model-based approaches with value

expansion and as we can see they

outperform the baseline in some cases

that’s more significant than the others

but by choosing so they’re a bit fragile

to some of the hyper parameters but

essentially all of the final results

were achieved with the same hyper

parameters across the task so this

showed that this algorithm actually

could help and improve at least so

sometimes it’s up to five times better

sample efficiency so some take away from

my experiments which i think are quite

important so using an samples for the

dynamics model was always necessary a

single model never worked or was never

able to beat the baseline and I think

this is reassuring team in the community

recently so training dynamics models

usually training them on multiple step

losses essentially trying to make them

more consistent as you feed their own

predictions into them also was necessary

for improving the baseline otherwise

usually the models converged the same

value but you kind of don’t get an

improving in sample efficiency also for

the valley expansion always if you

expand for more than one or two steps

was needed in order to get an actual

benefit of the method and one thing

which was quite interesting is that

being pessimistic seems to help so

essentially when you get the multiple

horizon targets rather than taking an

average or exponential average like TG

lambda taking a minimum over the

different horizons seems to do better in

the harder environments and this is kind

of similar to the double Q learning

we’re essentially trying off to kill

some of the overestimation bias of your

q function and I’m gonna skip this cuz I

don’t have time so thank you very much

for your time

[Applause]

[Music]

so the video that I shot was with a

model the trains on the dynamics and in

both cases it’s like it’s trained on the

same environment the difference is of

how you generate the video so in one you

start with a state of the environment

you enroll it once and then you predict

the next state and visualize that

however after that you feed whatever was

in the real environment to the model in

the second video that I showed where you

get this huge divergence essentially you

start from a state and then you unroll

the two things total independently so

then they are the model at every step

takes its own predictions from the

previous time step and never kind of

gets grounded again to the true

environment does that make sense okay

yeah like I can explain that later

[Laughter]

so do you mean whoever this during

training or during test time so yeah so

one of the points for the multiple step

loss was essentially that however so you

can do a multiple step loss where you

feed the prediction of the model into

itself multiple times and then you

ground it to the real what you saw in

the real world you can do that however

only with deterministic models and in my

experience actually with that kind of

loss for longer horizons I usually

needed to use maybe five to eight steps

horizon losses the deterministic models

did much better than stochastic ones

which however dose you can train only

with a single step

so um basically it was 5050 pretty much

on two of the environments the awesome

actually on one of the environment the

septic behavior was even slightly better

than the baseline which probably

suggests that the model has learned very

accurately the environment and you’re

gaining something more by having

multiple horizon targets in the other

environments in two of them they kind of

converge to the same thing in one of

them the model-based approach actually

is a bit worse in asymptotic behavior so

in general you can always do the valve

expansion and also interpolate between

the valve expansion target and the

real-world target which is only one step

and that for instance can elevate that

problem okay thanks for the questions

[Applause]