🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
hey everyone i’m jonathan ward and i’ve
been mentored by john shulman
um over the past six months that amongst
i’ve explored
the possibility for large-scale reward
modeling um so what this means in
practice
is how do we learn what people want
and then build models that are better
able to do that
so this uh this question really starts
with what should models do so there are
a couple of domains we can think about
is the formal domain in which specifying
the task is really clear
very simple um this would include game
playing board games video games
so a lot of the recent results in
machine learning around
beating atari or various video other
various other video games or chats or go
uh have really centered on these domains
where it’s really easy to provide clear
feedback to the model
about what to do next but a lot of life
is actually really much more informal
where specifying what uh correct
behavior looks like is
much harder to do and this is really the
area that i’m going to be focusing on
today
so there’s really a couple ways to
proceed um
and i’ll in particular contrast these
two there’s the idea of formalizing
what’s informal
so this means trying to write a function
that somehow captures
the nuances of problem at hand so this
would include
things like rouge or blue if you’re
familiar with those terms in machine
learning
these are functions that essentially try
to measure how good a summary is or how
good
a translation is but then there’s this
other approach
which is what i’ll focus on which
essentially aims to understand what’s
good by simply asking people
to compare two things or to rate
something so this is the setting
of learning human preferences
so the tricky part here is actually
getting these preferences getting
feedback so there’s a couple of ways for
seating
and a lot of excellent work has been
done at openai
and deepmind as well on this setting of
understanding
how to use feedback and to train systems
to incorporate that feedback and improve
their performance um
this prior work is really focused on
interactive feedback so feedback where
you hire contractors
and the researchers work with those
contractors to make sure there’s a
common understanding
and password has demonstrated they can
build accurate models
of human preferences using this type of
feedback
this is however expensive
but one potential way forward is to use
feedback that’s directly available in
the internet and
this is potentially much less expensive
you can potentially
uh gather more of this data and
you can potentially gather across
various tasks
so that’s what i’m really going to focus
on so the operative question is
essentially
can we train an accurate model of human
preferences
from feedback that’s gathered on the
internet
and so just to reflect on that a little
bit more
so with interactive feedback you get
some benefits right
um you can make sure the contractors or
whoever is providing the feedback has a
similar
sense of preferences that you or the
researchers or whatever the gold
standard you are
for the true preferences um is so you
can make sure there’s a close match
there so in some sense those are much
closer the internet feedback you kind of
get what’s already
out there right so if there’s a lot of
ratings of one thing being better than
another better stories
um better uh better question answers
then that sort of model of what exists
and what’s preferred is
is what you’re is what you’re capable of
learning from
um so then with this in mind
um i really wanted to focus on more
structured
task oriented uh feedback
so a lot of feedback on the united
generic where it’s sort of scattered
across various tasks so like
a like on twitter or youtube isn’t
really responding to a specific task
performance on a specific task
just as that comment or that video was
good um
but for these task oriented domains you
can actually get a clear answer of
whether a certain explanation
or a certain answer to a question was
good so
there’s a very clear sense of input
output
in particular i’ll be focusing on reddit
so reddit is the seventh most popular
site
in the us um and it’s organized into
subreddits that have particular tasks
and particular sort of
uh structures in the way that they give
feedback so they might value certain
things as a community
and i’ll focus on the community of our
slash writing props in particular
so this is a community of short story
writers
and it’s structured in the form
of a writing prompt so you get a writing
prompt if you’re
trying to respond to one of these things
it looks like this a small dragon must
defend his hoard a single coin
and then people will provide various
responses to that writing prompt and
each response will get some number of
uh upvotes down votes and together
they’ll
produce a resulting score so
these scores actually reflect some
measure
of the aggregate preferences of the
people on our such writing props
so we can try to learn a model
of those preferences using these scores
and in particular there’s a few models
that i want to train
so the first model is essentially the
generative model this is the model that
takes
writing prompts as an input and produces
a response
so this is somewhat analogous to someone
who’s browsing the subreddit and
writing and then the next
model is the evaluate model so this
would be analogous to someone who’s
lurking and provides like an upvote or
downvote
um on these stories so the evaluative
model
gets the prompt it gets uh two responses
to that prompt and simply outputs which
of the two responses is better
and then the last model in this sort of
system that i’m considering
is the gameplay model the the agent
model
so this agent model starts off from
the generative model so it’s just
something that learns to produce
stories that are similar to the stories
that it’s seen
but it’s further trained using the
feedback it gets from the evaluative
model
so the value order essentially provides
feedback
to one of two agents that are playing
against each other
and it provides that indication the form
of saying which one of the stories is
better so there’s a sequence of models
that i train here i start with
pre-trained models these are widely
available and provide a great starting
point for doing experiments
um i train the generative model i trade
the evaluative model and i combine
the generative model and validate model
to produce the agent
and then ultimately i can take those
outputs
and i can make them available to the
public to you all um for example
rewardmodeling.com which is a website i
built for this project
where i can actually gather some of your
feedback on whether
the the output of this model actually
matches your preferences
so the um the most important result is
really
how well does this reward model
generalize or how well does it actually
capture
um our preferences our model preferences
um so in order to assess this we
essentially
train this large model on some
number of comparisons and then we test
it on a set of comparisons that it
hasn’t seen before
then in particular for this project i
wanted to make sure that the model
wasn’t learning anything spurious wasn’t
learning
um how preference is based on how long
the
responses were wasn’t learning these
preferences based on how quickly they
were made
so i removed several confounders and i
filtered down to a hard test set
of roughly similar responses that were
made at roughly the same time
and uh with this i actually received a
final
accuracy model accuracy of 74.2
so to place this number in some context
right
it’s worth thinking about there’s some
inherent sort of noise
um in in sort of preferences
these are preferences for a lot of
people these are preferences that are
gathered across 10 years of reddit data
so they might vary over time um and to
understand this closer let’s take a look
at this graph which looks at how
this accuracy changes as we
increase or decrease the model size as
we increase or decrease
the number of examples it’s seen so this
on the x-axis this is the number of
samples in the training set so this is a
number
of examples that basically sees before
it does the test
right and then on the left you can see
the accuracy which is its performance on
this test set
so um i’ll draw your attention to the
performance
of gpt 2 xl which in particular is the
largest model that i trained
um so you can see here that it actually
learns the fastest of all the models but
then it essentially saturates
at this performance of around 74.2
percent
um and there’s a couple things that we
can think
draw from here um one is that larger
models
simply learn faster they can extract
more meaning from the data
that’s given to them um two
that there are continued gains from uh
from increasing
the number of samples that the model
sees um
but this is most pronounced for the
smaller model sets
um and then with this in mind
it’s really interesting to think about
what happens if we combine
data sets um across different subreddits
what happens
when we start to explore things like
transfer and this is really where i want
to take this project next
um so i think the basic idea is
what if we trained this reward model a
lot of different subreddits a lot of
different tasks
basically and tested its performance on
a task that hadn’t seen before
so many ways this more closely captures
what we want out of a model in reality
um for a lot of tasks we won’t have
the vast amount of feedback that comes
from
reddit for a lot of tasks we won’t be
able to actually gather
expensive human feedback um
so what we really want is a reward model
that’s been trained
on pre-existing uh signals
and then it actually performs well on
some evaluation that it hasn’t been
prepared for
and along these lines uh you can think
of
like a a long-term direction for this
field
sort of being like gathering feedback
from the internet is almost analogous to
pre-training
um in the common sort of language
modeling setup
and then this interactive feedback that
people gather
is analogous to fine-tuning um
in the sense that we can more carefully
construct
this set of feedback uh we can hire
contractors with particular expertise
or we can make sure that we have a broad
uh
reflection of all the feedback from
different people groups that we care
about
and that way we can make sure that the
true preferences
are more actively captured while also
receiving some performance benefit from
using
this internet feedback and with that
i’ll end on a bit of cautionary note
that um so like what preferences have we
learned so
rough reddit isn’t really representative
of the globe right it’s uh
it’s skewed in many ways right so if we
actually wanted
a model that maybe represents um
a very general sense of what we mean for
something to be a good story or not
we’re going to have to balance out this
uh this data set right
um and that will be important going
forward we’re going to want people with
more expertise in writing and we’re
going to white people with a lot of
different
influences um so with that i’d like to
thank
my mentor i’d like to thank the other
scholars especially sam
and danielle who donated lots of compute
um i’d like to thank the organizers for
making this all possible
and then i will answer questions
all right so
how would you get rid of any concern of
bias on reddit
so um i’ll answer this one live
so the i think the issue is you simply
can’t
um i think the the way to approach that
is to actually balance it with
getting other data sets um
like one one thing that i did do is i
filtered out explicit
explicit text um but
i think there’s a lot there are a lot of
issues there
so i think the future of this is
probably balancing
the um the internet feedback with a more
created
uh data set of feedback
do i think the 75 accuracy ceiling is
due to the noise and the labels or
weakness of the models
um i’d say it’s it’s potentially
a bit of both right um
i would be very interested in seeing if
uh
if a larger model with more capacity can
move past that but it does look like a
lot of the models like
the various size models were converging
around 75 accuracy
um so i actually think it’s probably
mostly noise in the labels
um and i did notice that when i
actually uh added the
the data of um of when the response was
made the delay between the submission
the response
to the uh to the language model input
the language model is actually able to
further increase its
its accuracy so it was able to take into
account both the
text itself and the speed of the
response
um and with that i think i’m at time i’d
love to answer
any other questions in the future but
that’s it for me