🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
last hi folks I’m Alex Rey I’m on the
robotics team here and this is gonna be
a different sort of talk this is an
overview of a recent large team result
and instead of being these like really
great in turn projects unfortunately
they’re they’re limited on people and
limited on time this is about 12 months
worth of work for about 12 people so
there’s a lot to get through I’m only
going to scratch the surface of it if
you want to learn more we have a really
well produced three-minute video it’s on
YouTube in our website it’s really cool
overview we have a blog post that’s a
nice like human readable format as well
as like a longer research paper on it so
lots of different levels of detail but
for now I’m gonna give you an overview
of what we did and sort of like a peek
into the inside of it so quick outline
of like I’m gonna break down my tiny
amount of time is describe the task the
sort of problem we try to solve the
research process itself sort of like
what solving it looked like the systems
we built which were actually very very
simple and what results we got from them
so our task we have this five finger
dexterous robot hand it’s under actuated
it’s got 20 degrees of control and 24
degrees of freedom for reinforcement
learning folks that deal with continuous
control that is an awful lot and we have
objects in the hand that we would like
it to be able to manipulate and for us
manipulate means achieve arbitrary
rotations so if you imagine holding a
small object in your hand can you point
it in arbitrary direction without
dropping it so our primary goal this was
sort of our North Star we want to
manipulate this object of the robot hand
and specifically we want a sequence of
50 independent randomly drawn rotations
some secondary goals that we wanted to
hit but weren’t exactly necessary where
we would like to solve it from vision
which means you don’t need a specialized
object you can just drop the object in
the hand and like as long as your vision
model can see it you’ll be able to
related we want to manipulate diverse
objects so not just cubes that say the
letters have opening on them and we want
to Train using a physics simulator
without any real data and so again our
North Star is a primary task the rest of
it or things that are sort of reach
goals so here like fish demonstrated
here’s an example of our physical setup
there’s this giant cage with a robot
hand in it and on the other side is our
simulation rendered with our simulation
renderer the robot is a giant bag of
unmodeled effects it has backlash it has
transmission problems it has creep and
stretch in the tendons and the simulator
has none of that here’s our secondary
goal of the the cameras is just sort of
showing where the cameras are in the
cage our secondary goal of diverse
objects this is manipulating a
octahedral prison which is kind of cool
and then trading purely in simulated
data so this is all the stuff we have
during training is we both have a
renderer that simulates our vision data
and a physic simulator that simulates
our physics data and sort of the process
that we want to do it is a lot of
dealing with robot hands it’s not all
training models but it starts with
training models so we really just over
and over for 12 months did this
iterative cycle that got faster and
faster as we got better at it it used to
take you no more than a month and now
we’re able to do this in like a few days
we train a totally new model with
reinforcement learning that controls the
policy of the robot and a totally new
model for vision that is able to
localize the object inside the hand we
try running it on the real robot we
observe that it fails and we observe how
it fails and then try to improve it and
repeat so with all of these systems like
reinforcement learning with robotics
with deep neural networks with physics
emulation all of these are sources of
complexity so in all cases largely we
are trying to focus on building the
simplest thing Network
in the simplest possible solution so one
of the things we did was we started very
ambitiously and then eventually had to
break it down to a much much simpler
task we initially started with trying to
achieve six degrees of freedom on the
object so not only where is it not only
what direction is appointed but where is
it like lifted up off the palm and
things like that we simplified that to
be just rotation we simplified that to
be just major axis aligned rotation just
like get the x axis to be up then we
just tried spinning it around Z and then
when we had trouble doing that we just
tried reaching the fingertips to
arbitrary positions in space um
eventually we were able to climb back up
this ramp but this was a big part in
like unlocking a bunch of research
project progress quickly earlier in the
project another sort of thing we learned
along the way is that you have to try
lots of things at once in the blog post
and the the paper we describe two of our
vision tracking systems but we don’t
really describe all the ones that we
tried that didn’t work here’s just some
I’m probably missing some we tried opto
optic tract like retro-reflective
infrared tracking dots these are common
in the motion picture industry we tried
depth cameras like a real sense or
Kinect we tried magnetic treat field
tracking like Palomas or like the sort
of early virtual reality controller
setups we tried active illumination
targets the face base those are those
red dots that you see on the fingertips
that actually did work we tried fiducial
and barcode tracking like a ruku and and
finally we tried the vision cameras and
while the paper sort of like is here’s
the simplest thing that worked there’s
quite a lot of things that didn’t work
ahead of that and the final research
ingredient for like how we were able to
solve this task was lots and lots of
domain randomizations
we found that like instead of trying to
accurately model the robot simulating
the robot with with all of its basically
bagramon model defects would have been
effectively impossible and definitely
would have run much much slower than
real time and we want a fast simulator
so we added lots of domain
randomizations where instead of
accurately modeling the world you just
have to model the noise of the world and
some
the noise the world is more efficient to
sample from so for example we don’t know
the exact friction on the hand we had
actually we didn’t even really know what
units of friction were or what same
values were so additionally there are
things on the hand that might
approximate themselves as friction like
ridges due to machining or a little
screw holes that objects could get
caught in so additionally there’s
different types of materials so instead
of like trying to accurately model these
we just took all the parts of the hand
which visually you saw sort of like
being different colors and we assign
them all very different frictions during
this time so somewhere along the project
we actually started we contracted a
professional roboticist to come in and
fix the hand whenever it broke because
we were breaking it so often and when we
told them what values of friction we
were using they were surprised it worked
at all but again these these these
simple systems are able to learn very
robust policies we also tried adding a
glove to the road trying to solve it the
other side so instead of solving it with
the AI solve it with the physical world
and it turns out gloves didn’t end up
working which is sort of surprising to
us but the domain randomization did so
here’s a quick overview of the system we
built a bigger graphic sort of from from
the one you saw from fish earlier we
have on the top left we have our policy
which is a very very simple recurrent
model and then in the middle we have our
vision model we trained both of them
separately they’re not actually trained
together we tried that once and it
turned out to not help so it’s
computationally cheaper to train them
both we did that and then when they’re
rolled out on the real world we just
sort of like slap them both together we
take camera images from the camera we
pass them to the vision model it
generates observations for the position
of the object we add that to the sort of
observations to the robot give it to the
policy act and repeat for the neural
network nerds out there here is a rough
diagram of the vision model it’s very
simple we like we’re sort of surprised
this worked sort of like fish described
three combinational towers
that that all shared parameters you slap
them together and then you guess the
position from and rotation off them and
fishes work did a bunch to like improve
this result it turns out that we were
able to just use this and get it to work
on the real robot the policy
architecture is also very simple like
normal actor critic setup we have noise
observations and a goal that are given
to the actor the actor is the
interesting one the one on the left
because that’s the one that actually
runs on the real robot
it’s a Floyd connect layer and hose to
him and the output actions that’s all it
took to achieve something that the field
of Robotics hadn’t been able to achieve
before and then the value network has
more observation that’s not noise so it
has basically more things going on but
mostly it’s used to calculate advantages
during training so it’s not actually
used on the real robot and we don’t care
that it doesn’t have to model the domain
randomization noise as much so this the
training architecture is like an
interesting part of the paper most of
what’s involved in this is this is how
you get to training a simulated robotics
policy on more than 6000 CPUs and eight
GPUs
most of this is in support of just being
able to do that training this turned out
to be the simplest thing that worked and
not only was it the simplest thing that
worked for us we actually sort of
inherited it from a different team at
open AI a system we call a rapid and
it’s the same system that powers the
dota bots so they have a lot of things
in common in in our paper we compare
both software actor critic and PPO but
the PPO is basically the same PPO that’s
playing very very well against really
good players that dota 2 and then for
our robotic specific system we have our
robotic specific things and for the dota
specific system and they have their data
specific things sort of alright results
what we found out I guess the most
important result is that yes we are able
to do this task we were able to do
simple object manipulation we can learn
a policy from a division model purely
from simulated
data which transfers to the real robot
hand and then sort of our sub results of
other things we talked about is site
volt earlier we’re able to act from
sparse observations so the actor again
the policy networks that thing on the
left is the one that runs on the real
robot that little X with note 4 is
interesting it’s listed in the paper we
meant to include this we didn’t as a
software bug and it still worked so I
was kind of surprising it turns out that
the we talked about some of the
surprising findings in the blog post
there’s a lot of things that like are
counterintuitive to what traditional
roboticists would think that we are able
to figure out I’m going to go through
these real quick we are able to track
objects from cameras we have a very low
positional and rotational error lower
than real images part of it is that
gathering real data is really really
really hard and unstable if you like
bump something in a set up where you
change the how it curtain is hanging
your real data gets old fast but
simulator keeps on going we’re able to
manipulate different objects we were
able to manipulate this octagonal prism
we tried a couple other objects that
around large round objects it has
trouble manipulating we’re still trying
to understand that we’re able to train
in a simple simulator even with all of
our innovations it still trains we’re
able to show that the randomizations
that we added improved performance in
the real world I’m just going to skip
this because I’m out of time having an
LS TM is better than having a confident
is better than just being fully
connected running with more GPUs gives
you better performance yeah
[Applause]
oh man there’s lots and lot
yeah they’re all enumerated in the paper
the short answer is as many things as we
could for the visual we actually used
unity which is a video game renderer
instead of the default render because
video game render is give you lots like
in in pursuit of being more
photorealistic they give you many more
dials they you can adjust the metallic
nasur glossiness or like the the color
of the reflectance we randomized
everything we could on the physics side
we randomized as much as we that was a
more manual process it turns out physics
emulators mostly want to be accurate
they don’t want to be inaccurate um and
a bunch of the effects were expensive to
model so we tried to simulate backlash
and we explained a little this in the
paper we don’t get it exactly right
modeling backlash is a really really
hard problem but we can sort of randomly
move motors in the opposite direction
and that’s close enough sometimes and so
the face base dots the dots on the
fingertips if you curl them all the way
in they can’t be seen by the vision
system and they go away and it’s hard to
model exactly which states those dots
disappear in but we can just have them
disappear for 25% of the time and it
roughly approximates it so yeah many of
it is is the things that we can figure
out that are easy to randomized we just
randomized by default and for things
that we think would help them that we
noticed on the robot and have hypotheses
about will you manually add every single
one does that sort of answer your
question
some so some randomizations help some
randomizations hurt some randomizations
sort of break-even in basically every
case were able to solve the task in
simulation this is something that’s sort
of different than like normal academic
reinforcement learning in like mature or
simulated physics worlds is they care
that they can solve in simulation
basically everything we did we were able
to solve in simulation the only thing
that counted is if it worked on the real
robot and we’re very very limited by how
many runs we can for every trial we
would have to run our baseline policy so
a policy with known performance just to
make sure the robot wasn’t broken in was
behaving the correct way for a bunch of
times and then we’d have to run our
experimental policy a bunch of times
there’s a bunch of domain randomizations
we’re not sure if they improved because
we added them all at once and they
helped so we kept them all