🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
hi everyone my knight is fished home I’m
currently a PhD student in Carnegie
Mellon machine learning Department I’m
interning the robot II teams working
with Peter and watching today I’m going
to talk about robust vision based stated
estimator so the goal for the robot II
team is to actually build a very strong
robotic background such that we can use
it to solve real AGI you know we we are
focusing on develop in general learning
based algorithm such that it can works
on a diverse of robotic tasks in order
to make sure the algorithm we develop is
general enough we need to pick up like a
pretty hard task in order to make sure
this algorithm really work so the task
we pick here is to have a shadow hand
rotating object to a desired pose so a
shadow hand robot is a robot that looks
exactly like human hand it has five
fingers it has several joint it can do
very delegated posts this task is
particularly difficult because if you as
a human engineer to hard-code this high
dimensional control it is nearly
impossible but before the robot can
start moving his fingers and try to
solve this task the first question we
want to answer is how do then the robot
know where is the object and what is the
object is posing in other words how do
the robot know what is the position and
orientation of the object in robotics
and reinforcement learning we call these
two state to ask the main States you can
actually use a 3d tracker
to use a 3d tracker you need to attach
some markers on the object you are
interested in we put our robot in a huge
cage where we can surrounded our robot
with a bunch of 3d sensors this sensors
will read off signal from the trackers
and then tell you what is the street
location of this markers so as you can
see this method actually can give a
pretty accurate object state but it is
pretty expensive and it kind of restrict
the robot to only see object in this
cage and besides if today I have a new
object I need to ask somebody to label
the marker first for me
which is impractical so here we want to
use like a more biologically inspired
solution which is just use cameras so we
can set up three cameras on this cage
the cameras are looking at our robot and
our object and from there we try to
infer the state of the object
so there are several benefit of building
a vision based state estimator it can be
more flexible as cheaper is easy to set
up and it’s really what we want to feel
on our robot
so to summarize our environment look
like this you have the cage you have the
robot in the center you set up some
camera that is looking at your robot and
your object and from this cameras you
captured three images that is looking at
the scenes from a different point of
view and then the question we want to
ask is how do we go from these three
images to the state we want to know so
now I introduce our method our model is
a tip neural network we say neural
network is super powerful because it can
generalize a lot of tasks achieving
state-of-the-art results and most
importantly it doesn’t require the
engineer to hard-code the features so
before going into how to train this
neural network let’s go into detail of
how this new an hour actually looks like
so our now we’re taking three images as
input we will pass each of the image
into several convolution layers and
fully connect layers and then we
aggregate an output from this
convolutional towers to predict our
final object state to train such Network
you need aside from the network itself
you
two additional seen one is a large
amount of data that contains a lot of
example of what are the inputs and what
are outfits like for here our inputs are
the three images and our output is the
state of the object so so then the
question is how do we get this large
data set we say it is impossible to get
it in the real world although we are
able to get the images we are not able
to know where are the objects so in
order to solve this problem we actually
use a simulator in a simulator we build
a very similar robot a very similar
object and we also set up three very
similar cameras that is shooting from
right top and left and then from this
simulator we can actually also read off
ground true state of the object so our
training data actually looks like what
what is on the right hand side that we
have three images pair with the ground
true state of the object besides that we
can also because it’s in the simulator
we can easily change the texture and
color of our robots we can change the
lightness we can also change the
background so in to if we do that we can
get like a very diverse data set that
can cover maybe all the situation agent
my my encounter in the real world so by
having this large amount of data and
this objective function that basically
minimized the distance between your
prediction and the ground truth we are
able to train the network in the
simulator and it works well in the real
world so it actually can predict super
accurate state of the object in the real
world and we are able to use it to
really solve something with with this
state estimator however there’s still
some problem with our current vision
system the problem is our simulator
actually need a very carefully aligned
camera in in in the simulator
means your camera need to have exactly
the same position as the one you have in
the real world so how we get this value
is the engineer need to go into the
simulator slightly adjust this camera
until the images are super aligned so
this actually requires extra effort
because every time we change the camera
rotation we need an engineer to do this
again another thing is if during test
time someone accidentally touch one of
the camera on the cage then the vision
system break so here I want to emphasize
how serious is this problem so I show
some state prediction error on real
images using a calibrated environment
and not calibrated in problem and you
can see on the blue curve is with
calibrated cameras and the others are
with misaligned camera and the
performance degrades a lot so why this
is why this is happening
because our neural network is actually
trying to find out someone useful
pattern and from this useful pattern you
want to drag LIGO to you the output you
want you want it to be predict because
your only objective function is this
Euclidean distance between the branches
and your prediction so if we sit back
and think how people figure out this
problem is I have this hand here today I
want to ask you like what is the
position and orientation of this cube in
this 3d coordinate system given only
this image captures from some random
camera well what will you do you
probably will say oh yeah of course
today we focus on objects so I
definitely need to like focus on the
object and you probably will also say I
want to know where this images captured
from so you might also look at the arm
of the robot to figure out what is the
camera position combining your object
detection and your position of the
camera all together you figure out what
is the global state of this object so in
order to answer this
and actually human need geometry
knowledge and also human need attention
on the right object so then the question
is how do we tell the robot that these
tools are important and you need to
somehow encode this in your solution the
solution is because in the simulator we
can get a bunch of 3d information we can
get whatever branches we want so in
order we can actually extract more
information from the simulator and then
add these two to the neural net we’re
trying to tell him so let me introduce
these two so the first one is we want to
force the network to learn about
geometry so we say when we are doing the
previous practice we’re actually trying
to infer where the camera is shooting
from so on top of the output there of
each of the image I will add an optional
task to say oh please predict as well
the camera position orientation to make
sure that the robot really understand
where this image is captured for so
biting adding this accurate test we can
actually improve the prediction on
position of the object so you can see
the red line is without this actuary
task and the green line is with this
after a task so by adding this we are
actually forced in an hour general to
generate like a more reasonable solution
for this the other scene is we want to
cast right attention to to this task so
previously we say oh because we are
answering like what is the state of the
object so we definitely need to watch
the object instead of focusing on the
background so we circle the object and
we say if I want to answer what is the
position of the camera I probably need
to focus on the arm of the robot so
these tools are the same we we feel like
the robot should add this focus on so
here I asked the robot to further
predict bounding box for this to object
note that the
pounding bass can be obtained from the
simulators in a simulator we can get
like any kind of cultures we wanted and
then from this bounding box location we
extract localized features and
concatenate lists localized features for
our final prediction so by doing this
this kind of forced attention actually
the network can improve on orientation
prediction so the red line is a bad sign
without any attention the blue line is
with attention on the cue and in the
bottom line the orange one is actually
forcing attention on the cube and also
the art of the robot so here is some
conclusion and take away from me is we
are proposing like a learning-based
vision system that is very robust to the
camera position and we also hope it can
transfer transfer well to the real
setting so previously in our release we
proposed this domain randomization where
we randomized all kinds of like textures
lightness and background in the
simulator but here I want to emphasize
like in order for for our visual model
to really understand 3d to understand
geometry
maybe we can extract more supervision
from the simulator the first thing is we
can enforce in geometry learning by
adding actual tasks like predicting
camera bells and we can force in right
attention by asking the the vision
system to predict bounding box and also
use the attended feature thank you
we have like a 1 minute Q&A oh okay hi
oh please you’d still so so actually I
have tried to randomize the camera
position like crazy like the camera can
move all around in this space but then
this actually decreased a lot on the on
the prediction of overall even in a
simulator so that means if you just have
this naive model that fit in image
passing through some convolution it is
now going to figure out this answer and
also we try to randomize the camera
position in our preview release and it
kind of like heard the final final
performance on the real image so we kind
of remove it so the thing is or narrower
structure is not good enough so that
even if we randomize the camera so much
it cannot learn something useful so it
needs something else to help it thanks
oh one more question
Oh does anyone too
hi how do you think that your approach
would work when cameras not in the hand
because you know camera will have you
know you have the extra complexity of
the depth sensing and so wise
differences there are types of wines
differences are not necessarily always
reflective of real-world difference in
distance so yeah how do you think that
this would work when the object is
further away it’s further away so if the
object is kind of like it has a
distribution that is not in the
distribution we train then it definitely
will diverge in the real like we cannot
make sure or do you mean if we have
extra at that sensor well they help so
my question is more like do you think
the methodology would also apply when
you have the object further away such
that you have right here is that you
want the robot to be able to point to
the object yes so it’s the state I guess
would just be the position of the object
so if all the cameras are here and like
kind of enclosing the robot there were a
lot of hand in a cage and the object is
over here yeah sure you can have all the
camera positioning and stuff figured out
but you want this extra distance to
account for right that the depth
perception capabilities with the camera
yeah so do you think that I think it’s
like it’s more like how do you think
that this approach
so I think that extra complexity can be
solved if you have like an additional
camera that is looking at the side so
then you can make sure this dimension is
correct but the thing is if something is
super far away like even for human like
if you have a car that is super far away
like is cross the street for you you
cannot predict precisely like what is
the distance between you and the car