Learning Multiple Modes of Behavior in a Continuous… ｜ Tyna Eloundou ｜ OpenAI Scholars Demo Day 2021 ｜ OpenAI

Video

Transcript

okay uh hi everyone

i’m taina and i was mentored by josh

acham

i’m excited to present um my scholars

project where i engineered a framework

that disentangles

data containing many behaviors from

different experts

to learn to steer a model towards one

mode

of behavior or another

so the internet is chock full of data

that many of our machine learning models

leverage for training

these data however are produced by

people entities

or organizations that have their own

utility functions

they can therefore be thought as being

produced conditional on those utility

functions so when

our models ingest this large chunk of

data wholesale they tend to assimilate

and reproduce these behaviors

of course that are contained in these

utility functions

and as researchers and designers we may

want to retain

the ability to steer our trained models

towards or away from some modes of

behavior

furthermore as our models grow in

capabilities

and are applied to increasingly complex

and diverse settings we may want to

steer their behavior

to align with the context or human

preferences

so with that super ambitious motivation

i’m going to motivate sort of the

experimental setup

so our proposed solution looks something

like this

in an offline rl setting we take uh

i mean i took a batch of data and tried

to learn from it

mode conditional policy so in the ideal

scenario

this single policy that you see on the

left um

is conditional conditional on the state

and some

context vector you know z one through n

would correspond exactly to a policy of

some expert conditional

only on the state so for example uh pi

of a

conditional on state and s and z of

n could correspond exactly to pi

expert n so what is the process for

training this model

first we collect samples which is

straightforward we chose to work with

samples rather than export

policies because examples of success are

easier to come by in nature

especially if you think back to the

motivating example of internet data

then these samples are passed on to a vq

vae

that is responsible for clustering them

and traditionally

the vqve produces discrete labels uh

and then but in this context we pass

distances uh instead which i’ll get to

in a moment

to a generator which in this context is

gaussian mlp actor that will recover a

probability distribution

conditional on the current state and

proposed

cluster label cluster information so

let’s take a closer look at uh

the architecture

uh so the first sub model that we have

is a vqv80 inspired by ord adele

uh from 2017 with a small modification

so the vq vae as usual has the objective

of distilling

its inputs here which are state

transitions

well enough that the decoder can

reconstruct and can reconstruct them

in the middle here uh you see an

embedding space which maps

the encoder representations to clusters

via a simple

argument function that is if a sun if a

tensor is closest

in euclidean distance to embedding

tensor j

then map that sample to cluster j

note that this means that the the

dimensionality

of the embedding space determines the

maximum total number of clusters you

allow the vq vae to create

if you set k to n then you can get up to

n clusters i want to note here that the

labels

or these uh this clustering information

is the most

essential component here where

a traditional vq vae produces discrete

labels

instead of these labels we simply take

those distance vectors

uh as they contain original signals now

we take these distance vectors and they

become the instructions

that we send to the generator to tell it

how we want

it to behave uh given the state

so now that we’ve received some

clustering information in this case

distances instead of labels from the vq

vae we concatenate it

to the state and pass it through a

gaussian mlp

so from that uh distribution we evaluate

the probability

of the actions that we see in the data

taken by the quote-unquote expert

and and what happens is we increase the

probabilities of the true action under

the conditional

um normal distribution and

okay so this is what happens at infant’s

time

the model observes a state and

concatenates a context vector

uh provided by the supervisor which in

this context is me

and then produces a conditional policy

and draws an action uh the context

vector

is this uh type of vector that we

call not hot encoding because the

distances

uh are minimized and so therefore at the

at the correct label

index as the model train fees vanish to

zero

so let’s look at what the training

objective looks like

in math and then in words so here’s the

training objective

um if you don’t understand what it means

don’t worry about it for those who

recognize it

the numerator is the vq vae objective

and the denominator

is just the conditional policy loss

so in words what was all that math about

so the first term of the numer the

numerator is the reconstruction loss to

encourage the encoder and decoder to

communicate

effectively through good latent

representations

then we have l2 loss from the encoder

output

that incentivizes the encoder to make

representations that are close to the

embeddings

and we also have an l2 laws from the

decoder output that incentivizes the

embeddings to stay close to encoder

representation because

the embedding space is dimensionless and

we wouldn’t want it to grow indefinitely

lastly in the denominator as you saw we

had a policy loss which makes

actions more comparatively more likely

when the clustering algorithm is more

confident about the context

and this dependency is

uh reflected in the fact that these two

models the

mlp and the vqve train uh concurrently

so let’s take a look at some demos

so this is the setting we chose for

experimentation it’s

a continuous control environment that i

chose specifically because uh

the expert behavior can be explicitly

designed and you can test the quality of

context-specific imitation

so in this setting you have an agent in

red

that lives on this lane where he can

navigate to any space using an action

vector that selects forward and

rotational velocities respectively

there’s a goal

that you see there in green that resets

a random location elsewhere on the plane

whenever it is reached and then there’s

some hazards

in purple and a vase uh the aquamarine

object you can ignore

so here we have two custom design

experts

um i call them experts because they’re

just very good at you know one

particular thing on the left you have

one that is a goal seeking agent

uh only cares about pursuing the goal uh

and you can see that it

has gotten very good at it uh because on

that

panel you see in the blue dots are the

goals and the various locations that

they’ve

uh respawned when they’ve been reached

and the

red dots are the hazards so i want to

emphasis and on the right you have

a forward moving agent so i want to

emphasize here that in

all of these plots that you’ll see um

the setting is exactly the same it’s

seated so that the the placement

of the goals the hazards etc are exactly

the same so the only thing that we

change

is the context vector that we feed to

the trained

model

okay so how well does our clustering

work

um i like to point out two factors that

seem to significantly influence

expert behavior uh in coding and the

in the vq vae one is k

which is the number of allowed

partitions that you

give to the vq vae and one is the

the step size uh the time step size

when calculating state transitions so

here uh

in between

transitions there is a time step of one

and

so we can see how as you increase

the number of partitions allowed by the

vq vae it is better able

to map different expert behaviors to uh

to different latent spaces so with uh

just two or three allowed partitions it

really struggles

and then when you give it four and five

it maps different behaviors

uh two different spots and here we have

a time difference of five in the state

transitions and so when you take larger

steps to calculate

the transition the model seems to learn

to cluster much

better and faster so you can make maybe

think of this as a case

where one agent walks forward all the

time and the agent walks forward

maybe for three steps and then turns in

the one-step scenario the model will

have difficulty separating the two

agents

uh when they’re walking straight for

some of the time a future direction

therefore might be to model some long

short term dependencies such as with

lstms or with attention

okay so to refresh your memory here’s

what these

experts look like in this setting um it

looks like i’m running out of time

and so this is what uh the different

modes we gave this uh

this this particular model uh four

partition we said k equals to four so

this is after one epic of training

and so we map the different modes on the

very same environment

setup and they don’t look very different

and this is what it looks like after

a thousand epics and there’s a little

bit of differentiation that happens

uh one of the modes wants to sweep in

white circles and mode three learns to

go in the forward direction

and this is after 5000 epics uh about

2.5 million samples with the

very same settings and now we see that

mode 2 sort of picks up the goal seeking

behavior

pretty clearly and mode three uh

continues to learn but not perfectly

um the forward moving behavior mode one

and mode four don’t seem to be mapping

to anything in particular or maybe a

complex combination of behaviors

so there are so many threads for a

future direction

of this research research as i mentioned

earlier

uh map mapping or modeling longer term

path dependencies like with

lscms or attention modeling is one that

i’m really interested in working with

i’d also love to be able to extract

generalizable properties

for these types of models as you move

from one context to another

and it’d be interesting to look at

quantitative performance guarantees for

example if you have some

experts that are bound to hit certain

metrics like

energy consumption um hazard uh

destruction

i want to make sure that the modes at

least approximate those

the modes that they correspond to and

i’d love to

then learn how and when to mode switch

you have a model that can

learn to mode switch on its own uh from

human feedback other environmental

feedback and i’d love to

experiment with different modalities

and interpretability so

thank you so much uh i would like to

thank my

mentor josh for his unwavering support

throughout this process

uh openai for this amazing opportunity

and

um all the staff for uh their continuing

availability for all my questions

and my fellow scholars for their billion

suggestions

and feedback and now i’d be happy to

take some questions

okay let’s see here

oops

sorry

okay all right

okay the question is right now a

limitation for the vq vae

is that when the number of modes is

small by comparison to the number of

n-step behaviors

the model could exhibit and when

some ends that behaviors could be shared

between modes disentanglement could be

quite hard

uh what do you think might be

interesting to do moving forward for

better

behavior disentanglement yeah i think

that it’d be interesting to look at

uh longer path rollouts and maybe in

including on top of um some

[Music]

of the discretizing modes of the vqve

adding some continuous

information in the information that the

generator sees

so that you have a longer context

from which to deduce behavior

uh i think that is

the only question um i would be happy to

answer more questions offline and um

i’ll people see my blog post over the

weekend as well

with more details thank you all and i’ll

pass it back to christina