🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
the first graphic I wanted to share with
you guys was a graphic of reinforcement
learning agent playing an Atari breakout
video game and that’s an example of a
case where standard RL can learn in a
certain environment very well but the
problem I’ll be focusing on is
specifically a problem with delayed
rewards and this is very relevant to
real life because as you interact with
your environment you don’t typically get
points every time you move there’s
actions that are separated by very long
timescales from their effects Kathy
would you like me to help you move your
sleds yeah what I can’t seem to change
my screen my keyboards not responding to
that I’ll show them
oh okay thanks Francis and so here’s the
I think I might have it apologies for
the technical difficulties
okay can you guys see my screen my
slides yes okay um so the plan is going
to be three parts so first I will
describe to you why standard RL does
struggle and these tasks with long
delayed rewards next I will describe the
temporal reward transport algorithm or
trt that I’ve been working on to address
this problem and then finally I will
share some results from experiments
using TRT okay so um the motto in
reinforcement learning is you have an
agent interacting with the world and as
it interacts it transitions from state
to state and it can pick up a reward
along that trajectory and so here I have
an equation or something called the
discounted returns and this is just the
sum of all the rewards that the agent
picks up along its trajectory um but
there’s this extra bit that’s added to
this returns and it’s this discount
factor gamma and this gets at the crux
of why standard RL algorithms do
struggle with tasks with delayed rewards
the gamma introduces a timescale and
it’s basically a heuristic that says
that you care about rewards now versus
later in the future so you’re
discounting your future rewards and so
in this plot here I have an example if
you’re standing at time zero and you
look forwards a hundred times steps then
the reward a hundred time steps in the
future is discounted by a factor so
that’s about 37 percent of its original
value and so this is totally fine if
you’re in an environment where your
immediate actions really just effect the
the most immediate rewards but in cases
with long delays then it’s not going to
work as well and so let’s take a look at
that so here’s an example we have this
little guy who is walking through the
environment and at some point he can
choose to pick up a key or not
he doesn’t get a reward for picking up
that key so this agent continues and
interacting with the environment until
the very end
it reaches a green goal where if it did
pick up the key in that first state then
its rewarded extra bonus points 20
points um so the problem with standard
Ariella though is that it wants to
reinforce actions based on the rewards
that were acquired after taking that
action but if we look at the rewards
that would weight this particular state
action pair where he’s next to the key
we’d see that the future rewards are
highly attenuated and so there’s a very
really low signal and learning is very
slow as a consequence and so what can we
do about that
that brings me to the next part of my
talk which is a the tier T algorithm so
this algorithm was based on work by hung
from deep mind on optimizing agent
behavior of her long time skills by
transporting value and the idea is that
if you’ve identified the significant
state action pairs that should receive
credit for some long-term reward then
you can splice those distant rewards to
these state action pairs so that you can
amplify the signal to reinforce those
actions and so that’s what we see here
so in this original situation on the
slide this agent receives zero immediate
points we’re picking up the key but we
splice in these future rewards in this
case the distal rewards might be 20
points suddenly we have a lot more
signal to amplify to increase the
probability of taking an action in that
state which is what we want we want the
agent to learn to pick up the key so
that brings us the next question how we
know what the significant state action
pairs are in order to receive these
spliced rewards the TRT on so that is
the problem of credit assignment in
green flow and reinforcement learning
and the way we do that and the way that
hung did it is using an intention
mechanism and so the idea is you do a
full rollout of an episode so the agent
interacts with environment at the end of
that rollout you pass the entire
sequence of states and actions to a
model in my case a binary classifier and
you look at
what state action pairs were paid most
attention to by the other frames excuse
me and so here I have a heat map and
this heat map is a plot of the attention
scores and you can see there’s these two
really bright stripes oh by the way the
axes denote on the frames and the
trajectory you both on the x and y axis
so in this case there’s two really
bright stripes and they correspond to
highly attended state in action pairs
and if we do a sanity check we can
actually bring up what that particular
observation was we see that it actually
makes sense with what we would expect so
in this case we have an attention and we
have an a an agent who’s a little bread
triangle next to a key so it might be
moving forward towards it or trying to
pick it up so this is a good
confirmation that we are attending to
the important states in actions
so next step is to test out whether this
TRT algorithm works and so I created an
environment specifically constructed to
make it challenging to learn if you
don’t do credits I meant over the long
timescales in this case this environment
has three phases first phase the agent
is encounter as an empty grid with just
a single key and Aiden can choose to
pick up that key or not but it doesn’t
receive any immediate reward if it picks
it up the second phase is a distractor
phase so we fill this distractor phase
with gifts and when the agent opens a
gift it gets immediate rewards and then
the final phase this is the focus of our
evaluation is the phase at which the
agent can earn a distal reward so when
the agent navigates to the green goal if
they learn to pick up the key in phase
one it gets 20 points if it never
learned to pick up the key it just gets
five points so that’s going to be the
focus of the the rest of the
experimental results I’m going to show
you we will focus to see if the agent
learns to pick up the key and
correspondingly get the twenty points in
Phase three
so the I have three separate
experimental slides I’m going to show
you
they all involve varying the parameters
of the distractor phase essentially
making it more and more challenging for
the agent to learn and so the three
parameters I vary are the time delay
which is the time that the agent is
forced to spend in the distractor phase
the gift reward size for the distractors
and then also the variance of the
destructor rewards okay so these plots
here show the total rewards that were
earned by the agent in Phase three to
see whether it picked up the key or not
and as you move from left to right it
becomes increasingly difficult each plot
corresponds to a certain delay and so
tau of gamma is equal to the discount
factor time scale and we see increasing
from left to right and so you can see
initially when the time delay is not
that long the agent does learn to pick
up the key doesn’t do quite as well as
with the trt algorithm on top of the
advantage actor critic the the baseline
is a TC advantage actor critic but then
by the time you get to the rightmost
plot you can see that ATC has basically
plateaued at five and that five points
if you recall corresponds to only moving
to the goal and never really there I
need to pick up the key and whereas if
you had trt that shows consistent
progress about learning how to pick up
the key and then the next slide is the
slide with experimental results for
varying the destructor award size so
again left to right it’s harder as we
increase the size of the distractor
rewards and again we see the same
pattern a TC with the trt algorithm does
better than a TC alone and then the
final slide was an experiment showing
the last returns in Phase three
when we vary the variance of the
distracted rewards so in this case we
have four gifts and they all have a mean
reward of five but for each gift we
sample from a uniform distribution
around five and we increasingly we
increase the range of the
a uniform distribution in order to
increase the variance and so they all
have the same being reward but there’s
just greater variance and again you can
see that ATC plus trt does better than
ATC alone
so to summarize we’ve seen that adding
to purl reward transport on top of
standard reinforcement learning
algorithms safe to show some benefit for
long term credit assignment this work
has built directly on the ideas from the
hung paper in 2019 and the two core
concepts to take away from that are the
idea of using some sort of temporal
value or reward transport to splice on
to significant state action pairs and
then to use attention to identify the
important state action pairs so our
contribution here has been a completely
different architecture implementing
these two core concepts it’s much
simpler than the original papers
implementation it’s also much simpler
environment but I think there’s
definitely merit and having showing that
this concept kind of just carries beyond
the original implementation and the
paper the implementation is also very
modular so I totally separated out the
attention part of this algorithm into a
separate classifier in order to identify
the significant state action pairs and
so you can imagine if you want to try
adding trt to some other model you can
do that it’s very easy to add it on
because of the modular implementation so
in the future well there’s tons of work
there’s never anything where there’s
always more this is just a heuristic and
so it would be interesting to move
beyond just a turistic but it’s a useful
heuristic um and also I’ve only shown
you results for a very simple grid world
environment and so it would be
interesting to see how well the
algorithm can hold up in more complex
situations so with that said I had a lot
of fun working on this project and so I
want to move on to the Q&A stage but
also I want to
sent out my cops so I’d like to thank my
mentor Jerry at open AI for being with
me through this whole process I wanted
to thank opening I itself for this
wonderful opportunity and all the
different people I’ve talked to informal
casual conversations I’ve picked up so
much thank you to the program organizers
Kristina and where I have been really
wonderful so supportive of the scholars
and I also have my lovely scholars
cohort to think they were also very
supportive there was a lot of knowledge
sharing as we all ramped up on deep
learning at the same time finally I’d
like to thank square my employer for
giving me this chance to take some time
off to do this program and then if
you’re interested in more project
details my write-up is available at my
blog if ab d be calm okay and then now
I’m going to take a look at cushioned
so the question is can you explain why
the distractor fitties makes the task
more difficult and in other words in
your opinion why does the agent not
learn the more general behavior of
simply interacting with all objects
picking up the key and opening the gifts
so I think this gets as with standard RL
algorithms
there’s this idea that for policy
gradients if you have a policy which is
how you want to choose an action in a
particular state you can amplify it or
reduce the likelihood of doing that
taking that action based on the rewards
that follow it and because of
discounting you won’t see rewards in the
far future and so in this case if you
are in the key state or in the state in
the phase one where your connects to the
key it’s not gonna receive that
amplification in that particular state
because the rewards in the distractor
phase are unique to being in the state
next to those distracting gifts so the
agent very quickly learns to open the
gifts because it sees an immediate
reward from the guess so that that’s
that waiting of the reward is very high
and so it reinforces this idea we want
to do the toggle action to open the
gifts and so it doesn’t just translate
thought of that simply just because
you’ve learned that you have to take
this particular action in this state the
way the algorithm is set up it doesn’t
just translate to learning how to do
that if your next is the key
okay the next question is I am curious
to know if using more advanced deep RL
algorithms like PPO would wait more than
the TR T’s influence and the results I
guess so I’m not 100% sure about this
wording it sounds like like what would
happen if I had tried it on PPO instead
of A to C and that’s a straightforward
test we can do because of the modular
and presentation on but so PPO is more
sample efficient than a 2 C it is able
to do several smaller updates compared
to one single update by ATC before
starting your batch of experiences so
given that I would expect it to have a
better learning curve I’m not sure if
the interaction with trt would be any
different but I would expect PPO to to
fundamentally look a little better than
the baseline that I showed here
I just haven’t tested it out yet
what was the most challenging part of
your project
oh there’s a lot uh well so I think at
some point when I finally when I decided
on this particular path I always had a
bunch of ideas for how to get it working
and every time I tried something new and
you commit to get hub I thought maybe
this will be it and so the the
challenging part was seeing the the
deadlines kind of looming and starting
to realize that all my fixes weren’t
necessary submit necessarily becoming
like the last fix but then in the end
things worked out when I had like a
little I realized there’s some artifacts
that were being introduced due to the
way this algorithm is set up it’s very
sensitive to the full context of the
episode and I had to think about how I
can handle that because it would have
required quite a bit of reengineering of
my code with very little time in order
to do parallel training which was really
important I needed to run this on many
workers and so I I had an idea to
rejigger or something well so I’m
running out of time um but but basically
it was just pushing through and it kind
of worked out in the end so I’m really
glad of that
and then so with that okay thanks