🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
hello everyone I am excited to be
presenting my scholars projects I can
focus on social learning and independent
multi-agent reinforcement learning so
groups my interest in social learning
came through reflection on how it is
that I as a human have the capacities
that I do so if I had happens to have
been born in the woods away from all
other humans
I would probably just have like quickly
starve to death but thanks to my ability
to tap into cultural knowledge I have
the potential to do all sorts of awesome
things like participate in a space
program or lie in bed all day and browse
Twitter and I think if one were to if
one were an alien who just appeared on
earth and saw an example of a human in
isolation I think it would be very
surprising to see the broad variety that
of behaviors that groups of humans are
able to exhibit or that individual
humans can do if they can tap into that
cultural knowledge so yeah because of
the centrality of social learning to the
human intelligence I think it’s
important to understand the
circumstances in which social learning
can take place and so in order to sort
of experiment with this there’s a cool
anecdote from experimental sociology so
basically a group of monkeys were put in
a room along with a ladder and some
bananas were suspended from the ceiling
such that they could be accessed by a
monkey who climbed the ladder but were
otherwise inaccessible and really
quickly do you won’t be cheering your
slides so that was the image you’re
talking about yes
you
you
apologies is that working better that’s
cool yeah so um yeah there’s a group of
monkeys in a room they can only reach
the bananas using a ladder and anytime a
monkey climbed the ladder to access the
bananas experimenters would spray the
rest of the monkeys with cold water so
the other monkey’s learned that they
should you know beat up the monkey did
that prior to climb the ladder in order
to prevent themselves from getting
sprayed and so this behavior persisted
even after the monkeys stopped being
sprayed with water and even more
interestingly when new monkeys were
introduced into the group after the
water spraying it ceased the new monkeys
of course would try to get to the
bananas and then the other monkeys would
beat them up so they would learn not to
access the bananas not to get the
bananas but they would also learn to
punish other monkeys that tried to get
the bananas this became like a cultural
phenomenon among the monkeys so as it
happens this experiment is apocryphal
and did not happen but I think it’s
still serves as an interesting template
for how we can try to understand social
learning so the question I’m interested
in answering is that of whether
independent reinforcement learning
agents can learn from each other just by
virtue of the fact that they exist in
the same environment and can maybe
observe one another and I think this is
an important question because in
reinforcement learning it becomes more
capable it seems likely that there will
be many environments in which many
reinforcement learning agents might
interact so for instance stock trading
autonomous and adaptive robot X trading
stocks in a market and so it’s clearly
important to understand the
circumstances in which they might learn
from one another and exhibit behavior
that we might not expect if we were only
looking at one of them in isolation
so I will break my talk down into two
parts first I’ll discuss the tools that
I use to approach this question
and in particular the environments and
reinforcement learning algorithms that I
used and then I’ll talk about some
actual experiments about learning from
experts so I developed a an open-source
grid world implementation called marl
grid which is fits the standard open a
IgM API it’s easy to extend so it’s easy
to put a bunch of a large number of HS
in the environment and it’s very
configurable and there are also some
registered environments so that for
reproducibility and given how obscure
this domain is I’m surprised that it’s
already got a little bit of traction on
github and this is an example of
visualizations that I’ve built these
agents are effectively untrained but
it’s easy to include a lot of them in
the environment and visualize what each
of them is doing
and the particular scenario that I spent
a lot of time working with I call goal
cycle so in this environment there are a
number of gold tiles and agents in the
environment are rewarded for traversing
them in a certain order and they’re
penalized any time they mess up that
order and it’s one can experiment with
this particular environment the one that
I’m trying here by installing it with a
Python package from github
so the that this environment is kind of
like an analogue to the room with the
monkeys so the reinforcement learning
agents that exist in this environment
can observe one another and in principle
interact with one another there are a
couple interesting things about this
environment the penalty is configurable
and changing the value of the penalty
changes the difficulty of learning to
explore the environment effectively when
the penalty is low the agents kind of
ignore the penalty incurred by stepping
on the tiles out of order so on the
video on the left the agent is not
cycling through them in order and
anytime the agent steps on a tile out of
order it’s color resets to red when the
penalty is very high exploration is
costly because the occurring the
penalties is aversive and the agents
learn to step on the first tile where
they get a reward and then they just
avoid all of them so by controlling
again by controlling the value of this
penalty we can change the difficulty of
exploration and in the context of social
learning we change the difficulty of
learning the effective strategy directly
from the environment as opposed to
learning it by observing other agents
and then the other big tool was
reinforcement learning algorithms that I
used
so I started by implementing dqn which
like pretty standard for this sort of
simple environment but I needed to add
memory in order for the agents to be
able to learn strategies that unfold
over the more than one time step this
didn’t work super well and I spent a lot
of effort trying to improve it notably a
limited prioritize experience replay
which is kind of tricky with the
addition of the LS TM that it still
didn’t work very well and i sorry
implemented PPO and immediately found a
pretty big improvement but further I
found that carrying carrying over the
some of the tricks from the architect to
implementation and notably refreshing
the hidden states that are collected
during the environment over the course
of update steps significantly improve
the agents capacity to use their
memories to accomplish tasks and these
diagrams show or these plots show the
difference that it made for a simple
goal cycle environment where the agent
is learning to traverse the goals
so basically when this trick is applied
the agents are able to achieve much
higher rewards and their training is
much more stable so you have to recap a
large part of the effort of the project
went into developing the reinforcement
learning algorithms and environments
that allow agents to effectively learn
tasks that are amenable to the kind of
experiments that I will discuss so um
revisiting the original question I’m
interested in knowing when independent
agents can learn to can learn from
experts to accomplish tasks or can
acquire skills from experts so what this
might look like is we might have a bunch
of experts who have a high level of
skill and a novice who’s introduced to
the environment
initially is there an unskilful but then
is able to get to the point of expertise
just by observing the experts and we’d
also want it to be the case that the if
the novice was alone they would be
unable to learn and their skill would
remain low so there is a paper that
addresses a question like this it called
observational learning by reinforcement
learning by divorce’ at all from deep
mind and in their paper the experts are
hard coded and novices use RL to
accomplish a task in a simple grid world
so the diagram on the top shows like a
bird’s eye view of the map the expert in
blue optimally travels to a goal which
is which at each episode is placed
randomly at one of these sixteen
positions and the novice needs to learn
to get to the goal as well here’s an
image of the video of that they found
that the experts help the novices learn
more quickly but the presence of the
experts even in the presence of the
experts the novices say the experts
don’t cause the novices to do any better
ultimately than they would if they were
learning alone so I
started by trying to replicate the first
finding in a simple cluttered grid world
which is like the goal cycle grid
world’s I showed earlier but where
there’s only one goal and found that
found very convincingly that the
presence of experts didn’t help the
novice agents learn to accomplish their
task any more quickly and the takeaway
here is kind of that it’s like hard to
learn from social cues in these
environments but that doesn’t prove that
it’s impossible and in order to look in
a more targeted way for the
circumstances in which this might happen
my effort shifted to different
environments and in particular the goal
goal cycle environment so the goal of my
experiments has been to construct a
scenario where in contrast to the oyster
results novices and experts are the same
sort of agent so they’re both trained by
the reinforcement learning where
solitary novices struggle to learn and
where the presence of experts helps and
ideally we’d want the novices to be able
to themselves become experts so that we
can see that they like have mastered the
skill and as a bonus ideally the whereas
in the borsa case the the information
that the novices get from the experts is
or there’s not all that much information
that the novices can get from the
experts because the goal is unlike one
of sixteen places and the novices could
just like memorize the potential places
we want something that looks a bit more
like a skill and so in the we get this
in the goal cycle environment because
the process of spawning in a new
environment and trying out the different
possible cycles until identifying the
correct one is more is a closer analogue
to skilled and just like queuing as to
which quadrant their goal is in or
something like that
so I found that when the golf cycles are
masked from the view of novice agents
novices do in fact learn to follow
experts and this is consistent with the
results from Porsche so in both of these
videos both of these videos exhibit this
behavior the novices are shown in the
bottom of the columns on the right and
yeah in both cases the novices are doing
like a really robust kind of like the
following behavior yeah here the one of
the experts happens to have spawned in a
trap
basically and in these cases because the
novices is the novices are just
following the experts they end up
converging to slightly lower performance
than the experts as you can see in this
graph so the so far the conclusion that
are drawn is that it’s like very hard to
learn from whether it to learn from
experts and when it’s possible to
acquire a skill directly from the
environment it is likely that agents
will do that so in order to the next
steps for this project which I’ll
continue working on focus on trying to
create environments where the social
that were the information available from
the experts is more valuable cue as to
how to obtain a high reward than the
information available directly from the
environment and so I plan to increase
the number of goals and experiment with
different penalty values and so on also
the in the example that I showed the
following behavior while it does help
the agent a crew more rewards isn’t
quite the same skill that the experts
are showing going back to the monkey
analogy we want the novice agents to be
doing the same thing that the experts
are doing exhibiting the same skillful
behavior
and so a better way to measure that
would be by looking at the performance
of the agents when they’re moved to a
new environment without agent without
experts and another approach is to add
mechanisms to encourage agents to learn
socially it’s not clear for instance to
what degree humans are social learners
because of like biological because
they’re biologically predisposed to do
so as opposed to because of the
environments that they’re in obviously
by comparing to animals we might expect
the former but yes so we can introduce
we can similarly introduce like these
priors into agents and then we can
characterize the emergence of the social
behavior by varying or turning down that
fryer so yeah I’d like to thank my
mentor Natasha
who’s been incredibly supportive and
incredibly helpful in both helping me
like make the best use of learning
resources and helping me engage with the
broader research community I’d like to
thank the program coordinators Mariah
and Kristina for helping the program
grow smoothly even in light of the
pandemic I’d like to thank my fellow
scholars for a lot of incredibly
informative discussions and yeah just
generally being extremely supportive
special shout outs to wince and biases
for helping me keep track of my
experiments and also to Alethea power
for lending me a graphics card that I’ve
been using for some of these experiments
yeah so I have time for some questions
so the first question is kind of novice
become more expert than an expert such
that other experts learn from it
that’s a great question in the
experiments I’ve been doing the experts
continue to learn alongside the novices
so here for instance in this plot the
experts are still learning but because
in this environment they happen to be
close to optimal so we don’t see much
change as they continue to adapt but in
principle yes this could happen
I think another interesting direction is
for understanding like social behavior
and independent multi-agent
reinforcement learning is to carefully
study the impact of just like learning
in a group which is kind of similar to
that
cool so another question is could you
elaborate on hidden state refreshing in
your agent when do you refresh the
hidden state and how old does it differ
from the r2d2 approach so the agents
that I had trained so I trained a lot
the agents of PPO
with PPO agents alternate between
collecting experiments collecting
experience in an environment and
updating based on that experience so
during the update phase the agents
sample their experience and perform a
bunch of small updates based on that
batch of experience before discarding it
at the end of each update so the hidden
States so the typical way that the
agents are in typical PPR lsdm
implementations the agents will save
their hidden states as they interact
with the environment so this is like
remembering what was in their mind
alongside the experiences and then they
will sample those as they are doing each
of these like little updates but the
nature of the experience that they
collected depends on the values in the
hidden state and the values in the
hidden state depend on their parameters
so as they update the hidden states the
the behavior in their the behavior that
they’re learning firm becomes less and
less representative of the earth is the
big divergence between the behavior and
there between the data and their
experience and the parameters of the
current values of the parameters so I
found that it wasn’t too costly to do
this and I have some tweaks to my like
lsdm implementation that facilitate this
and in the end I end up refreshing them
basically between each iteration each
gradient step and the r2d2 approach
differs in a few ways the reason for
those differences I think is mainly that
the r2d2 approach is off policy
and so each gradient step has the
potential to or that the volume of
experience that can go into each update
is much larger and so because of this
they need to employ some tricks to make
sure that the that the hidden states
don’t get too stale without refreshing
it between each iteration because that’d
be very costly but for PPO and on policy
reinforcement learning it like didn’t
matter too much
another question is why do you think
that proximal policy optimization worked
so well that’s a good question
I think let’s see so I have been
thinking a bunch about this and I think
that a a lot of it in practice comes
from the fact that I my implementation
of PPO is based on the spinning up
implementation and I guess spinning up
also deserves a shout out and so it
inherited a lot of tweaks that helped
help the agent learn stabili and perform
well and it is possible that if I yeah
so I hesitate to say that PPO is better
than the arc um that’s certainly my
experience but I think I inherited a lot
of a lot of improvements from the
implementation that I based it off of
and then yeah the hidden state
refreshing I think is interesting it is
yeah it helped immensely with robustness
and yeah I think the reason is that it
prevents the policy for making big
changes over the course of each update
and this yeah helps it helps ensure that
the policy is consistent with the data
that it’s learning from I guess I would
be interested to for some clarification
on that question but