Social learning in independent multi-agent reinfor… | Kamal N’dousse | OpenAI Scholars Demo Day 2020 | OpenAI

🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance

Video

Transcript

hello everyone I am excited to be

presenting my scholars projects I can

focus on social learning and independent

multi-agent reinforcement learning so

groups my interest in social learning

came through reflection on how it is

that I as a human have the capacities

that I do so if I had happens to have

been born in the woods away from all

other humans

I would probably just have like quickly

starve to death but thanks to my ability

to tap into cultural knowledge I have

the potential to do all sorts of awesome

things like participate in a space

program or lie in bed all day and browse

Twitter and I think if one were to if

one were an alien who just appeared on

earth and saw an example of a human in

isolation I think it would be very

surprising to see the broad variety that

of behaviors that groups of humans are

able to exhibit or that individual

humans can do if they can tap into that

cultural knowledge so yeah because of

the centrality of social learning to the

human intelligence I think it’s

important to understand the

circumstances in which social learning

can take place and so in order to sort

of experiment with this there’s a cool

anecdote from experimental sociology so

basically a group of monkeys were put in

a room along with a ladder and some

bananas were suspended from the ceiling

such that they could be accessed by a

monkey who climbed the ladder but were

otherwise inaccessible and really

quickly do you won’t be cheering your

slides so that was the image you’re

talking about yes

you

you

apologies is that working better that’s

cool yeah so um yeah there’s a group of

monkeys in a room they can only reach

the bananas using a ladder and anytime a

monkey climbed the ladder to access the

bananas experimenters would spray the

rest of the monkeys with cold water so

the other monkey’s learned that they

should you know beat up the monkey did

that prior to climb the ladder in order

to prevent themselves from getting

sprayed and so this behavior persisted

even after the monkeys stopped being

sprayed with water and even more

interestingly when new monkeys were

introduced into the group after the

water spraying it ceased the new monkeys

of course would try to get to the

bananas and then the other monkeys would

beat them up so they would learn not to

access the bananas not to get the

bananas but they would also learn to

punish other monkeys that tried to get

the bananas this became like a cultural

phenomenon among the monkeys so as it

happens this experiment is apocryphal

and did not happen but I think it’s

still serves as an interesting template

for how we can try to understand social

learning so the question I’m interested

in answering is that of whether

independent reinforcement learning

agents can learn from each other just by

virtue of the fact that they exist in

the same environment and can maybe

observe one another and I think this is

an important question because in

reinforcement learning it becomes more

capable it seems likely that there will

be many environments in which many

reinforcement learning agents might

interact so for instance stock trading

autonomous and adaptive robot X trading

stocks in a market and so it’s clearly

important to understand the

circumstances in which they might learn

from one another and exhibit behavior

that we might not expect if we were only

looking at one of them in isolation

so I will break my talk down into two

parts first I’ll discuss the tools that

I use to approach this question

and in particular the environments and

reinforcement learning algorithms that I

used and then I’ll talk about some

actual experiments about learning from

experts so I developed a an open-source

grid world implementation called marl

grid which is fits the standard open a

IgM API it’s easy to extend so it’s easy

to put a bunch of a large number of HS

in the environment and it’s very

configurable and there are also some

registered environments so that for

reproducibility and given how obscure

this domain is I’m surprised that it’s

already got a little bit of traction on

github and this is an example of

visualizations that I’ve built these

agents are effectively untrained but

it’s easy to include a lot of them in

the environment and visualize what each

of them is doing

and the particular scenario that I spent

a lot of time working with I call goal

cycle so in this environment there are a

number of gold tiles and agents in the

environment are rewarded for traversing

them in a certain order and they’re

penalized any time they mess up that

order and it’s one can experiment with

this particular environment the one that

I’m trying here by installing it with a

Python package from github

so the that this environment is kind of

like an analogue to the room with the

monkeys so the reinforcement learning

agents that exist in this environment

can observe one another and in principle

interact with one another there are a

couple interesting things about this

environment the penalty is configurable

and changing the value of the penalty

changes the difficulty of learning to

explore the environment effectively when

the penalty is low the agents kind of

ignore the penalty incurred by stepping

on the tiles out of order so on the

video on the left the agent is not

cycling through them in order and

anytime the agent steps on a tile out of

order it’s color resets to red when the

penalty is very high exploration is

costly because the occurring the

penalties is aversive and the agents

learn to step on the first tile where

they get a reward and then they just

avoid all of them so by controlling

again by controlling the value of this

penalty we can change the difficulty of

exploration and in the context of social

learning we change the difficulty of

learning the effective strategy directly

from the environment as opposed to

learning it by observing other agents

and then the other big tool was

reinforcement learning algorithms that I

used

so I started by implementing dqn which

like pretty standard for this sort of

simple environment but I needed to add

memory in order for the agents to be

able to learn strategies that unfold

over the more than one time step this

didn’t work super well and I spent a lot

of effort trying to improve it notably a

limited prioritize experience replay

which is kind of tricky with the

addition of the LS TM that it still

didn’t work very well and i sorry

implemented PPO and immediately found a

pretty big improvement but further I

found that carrying carrying over the

some of the tricks from the architect to

implementation and notably refreshing

the hidden states that are collected

during the environment over the course

of update steps significantly improve

the agents capacity to use their

memories to accomplish tasks and these

diagrams show or these plots show the

difference that it made for a simple

goal cycle environment where the agent

is learning to traverse the goals

so basically when this trick is applied

the agents are able to achieve much

higher rewards and their training is

much more stable so you have to recap a

large part of the effort of the project

went into developing the reinforcement

learning algorithms and environments

that allow agents to effectively learn

tasks that are amenable to the kind of

experiments that I will discuss so um

revisiting the original question I’m

interested in knowing when independent

agents can learn to can learn from

experts to accomplish tasks or can

acquire skills from experts so what this

might look like is we might have a bunch

of experts who have a high level of

skill and a novice who’s introduced to

the environment

initially is there an unskilful but then

is able to get to the point of expertise

just by observing the experts and we’d

also want it to be the case that the if

the novice was alone they would be

unable to learn and their skill would

remain low so there is a paper that

addresses a question like this it called

observational learning by reinforcement

learning by divorce’ at all from deep

mind and in their paper the experts are

hard coded and novices use RL to

accomplish a task in a simple grid world

so the diagram on the top shows like a

bird’s eye view of the map the expert in

blue optimally travels to a goal which

is which at each episode is placed

randomly at one of these sixteen

positions and the novice needs to learn

to get to the goal as well here’s an

image of the video of that they found

that the experts help the novices learn

more quickly but the presence of the

experts even in the presence of the

experts the novices say the experts

don’t cause the novices to do any better

ultimately than they would if they were

learning alone so I

started by trying to replicate the first

finding in a simple cluttered grid world

which is like the goal cycle grid

world’s I showed earlier but where

there’s only one goal and found that

found very convincingly that the

presence of experts didn’t help the

novice agents learn to accomplish their

task any more quickly and the takeaway

here is kind of that it’s like hard to

learn from social cues in these

environments but that doesn’t prove that

it’s impossible and in order to look in

a more targeted way for the

circumstances in which this might happen

my effort shifted to different

environments and in particular the goal

goal cycle environment so the goal of my

experiments has been to construct a

scenario where in contrast to the oyster

results novices and experts are the same

sort of agent so they’re both trained by

the reinforcement learning where

solitary novices struggle to learn and

where the presence of experts helps and

ideally we’d want the novices to be able

to themselves become experts so that we

can see that they like have mastered the

skill and as a bonus ideally the whereas

in the borsa case the the information

that the novices get from the experts is

or there’s not all that much information

that the novices can get from the

experts because the goal is unlike one

of sixteen places and the novices could

just like memorize the potential places

we want something that looks a bit more

like a skill and so in the we get this

in the goal cycle environment because

the process of spawning in a new

environment and trying out the different

possible cycles until identifying the

correct one is more is a closer analogue

to skilled and just like queuing as to

which quadrant their goal is in or

something like that

so I found that when the golf cycles are

masked from the view of novice agents

novices do in fact learn to follow

experts and this is consistent with the

results from Porsche so in both of these

videos both of these videos exhibit this

behavior the novices are shown in the

bottom of the columns on the right and

yeah in both cases the novices are doing

like a really robust kind of like the

following behavior yeah here the one of

the experts happens to have spawned in a

trap

basically and in these cases because the

novices is the novices are just

following the experts they end up

converging to slightly lower performance

than the experts as you can see in this

graph so the so far the conclusion that

are drawn is that it’s like very hard to

learn from whether it to learn from

experts and when it’s possible to

acquire a skill directly from the

environment it is likely that agents

will do that so in order to the next

steps for this project which I’ll

continue working on focus on trying to

create environments where the social

that were the information available from

the experts is more valuable cue as to

how to obtain a high reward than the

information available directly from the

environment and so I plan to increase

the number of goals and experiment with

different penalty values and so on also

the in the example that I showed the

following behavior while it does help

the agent a crew more rewards isn’t

quite the same skill that the experts

are showing going back to the monkey

analogy we want the novice agents to be

doing the same thing that the experts

are doing exhibiting the same skillful

behavior

and so a better way to measure that

would be by looking at the performance

of the agents when they’re moved to a

new environment without agent without

experts and another approach is to add

mechanisms to encourage agents to learn

socially it’s not clear for instance to

what degree humans are social learners

because of like biological because

they’re biologically predisposed to do

so as opposed to because of the

environments that they’re in obviously

by comparing to animals we might expect

the former but yes so we can introduce

we can similarly introduce like these

priors into agents and then we can

characterize the emergence of the social

behavior by varying or turning down that

fryer so yeah I’d like to thank my

mentor Natasha

who’s been incredibly supportive and

incredibly helpful in both helping me

like make the best use of learning

resources and helping me engage with the

broader research community I’d like to

thank the program coordinators Mariah

and Kristina for helping the program

grow smoothly even in light of the

pandemic I’d like to thank my fellow

scholars for a lot of incredibly

informative discussions and yeah just

generally being extremely supportive

special shout outs to wince and biases

for helping me keep track of my

experiments and also to Alethea power

for lending me a graphics card that I’ve

been using for some of these experiments

yeah so I have time for some questions

so the first question is kind of novice

become more expert than an expert such

that other experts learn from it

that’s a great question in the

experiments I’ve been doing the experts

continue to learn alongside the novices

so here for instance in this plot the

experts are still learning but because

in this environment they happen to be

close to optimal so we don’t see much

change as they continue to adapt but in

principle yes this could happen

I think another interesting direction is

for understanding like social behavior

and independent multi-agent

reinforcement learning is to carefully

study the impact of just like learning

in a group which is kind of similar to

that

cool so another question is could you

elaborate on hidden state refreshing in

your agent when do you refresh the

hidden state and how old does it differ

from the r2d2 approach so the agents

that I had trained so I trained a lot

the agents of PPO

with PPO agents alternate between

collecting experiments collecting

experience in an environment and

updating based on that experience so

during the update phase the agents

sample their experience and perform a

bunch of small updates based on that

batch of experience before discarding it

at the end of each update so the hidden

States so the typical way that the

agents are in typical PPR lsdm

implementations the agents will save

their hidden states as they interact

with the environment so this is like

remembering what was in their mind

alongside the experiences and then they

will sample those as they are doing each

of these like little updates but the

nature of the experience that they

collected depends on the values in the

hidden state and the values in the

hidden state depend on their parameters

so as they update the hidden states the

the behavior in their the behavior that

they’re learning firm becomes less and

less representative of the earth is the

big divergence between the behavior and

there between the data and their

experience and the parameters of the

current values of the parameters so I

found that it wasn’t too costly to do

this and I have some tweaks to my like

lsdm implementation that facilitate this

and in the end I end up refreshing them

basically between each iteration each

gradient step and the r2d2 approach

differs in a few ways the reason for

those differences I think is mainly that

the r2d2 approach is off policy

and so each gradient step has the

potential to or that the volume of

experience that can go into each update

is much larger and so because of this

they need to employ some tricks to make

sure that the that the hidden states

don’t get too stale without refreshing

it between each iteration because that’d

be very costly but for PPO and on policy

reinforcement learning it like didn’t

matter too much

another question is why do you think

that proximal policy optimization worked

so well that’s a good question

I think let’s see so I have been

thinking a bunch about this and I think

that a a lot of it in practice comes

from the fact that I my implementation

of PPO is based on the spinning up

implementation and I guess spinning up

also deserves a shout out and so it

inherited a lot of tweaks that helped

help the agent learn stabili and perform

well and it is possible that if I yeah

so I hesitate to say that PPO is better

than the arc um that’s certainly my

experience but I think I inherited a lot

of a lot of improvements from the

implementation that I based it off of

and then yeah the hidden state

refreshing I think is interesting it is

yeah it helped immensely with robustness

and yeah I think the reason is that it

prevents the policy for making big

changes over the course of each update

and this yeah helps it helps ensure that

the policy is consistent with the data

that it’s learning from I guess I would

be interested to for some clarification

on that question but