Large Scale Reward Modeling | Jonathan Ward | OpenAI Scholars Demo Day 2021 | OpenAI

🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance

Video

Transcript

hey everyone i’m jonathan ward and i’ve

been mentored by john shulman

um over the past six months that amongst

i’ve explored

the possibility for large-scale reward

modeling um so what this means in

practice

is how do we learn what people want

and then build models that are better

able to do that

so this uh this question really starts

with what should models do so there are

a couple of domains we can think about

is the formal domain in which specifying

the task is really clear

very simple um this would include game

playing board games video games

so a lot of the recent results in

machine learning around

beating atari or various video other

various other video games or chats or go

uh have really centered on these domains

where it’s really easy to provide clear

feedback to the model

about what to do next but a lot of life

is actually really much more informal

where specifying what uh correct

behavior looks like is

much harder to do and this is really the

area that i’m going to be focusing on

today

so there’s really a couple ways to

proceed um

and i’ll in particular contrast these

two there’s the idea of formalizing

what’s informal

so this means trying to write a function

that somehow captures

the nuances of problem at hand so this

would include

things like rouge or blue if you’re

familiar with those terms in machine

learning

these are functions that essentially try

to measure how good a summary is or how

good

a translation is but then there’s this

other approach

which is what i’ll focus on which

essentially aims to understand what’s

good by simply asking people

to compare two things or to rate

something so this is the setting

of learning human preferences

so the tricky part here is actually

getting these preferences getting

feedback so there’s a couple of ways for

seating

and a lot of excellent work has been

done at openai

and deepmind as well on this setting of

understanding

how to use feedback and to train systems

to incorporate that feedback and improve

their performance um

this prior work is really focused on

interactive feedback so feedback where

you hire contractors

and the researchers work with those

contractors to make sure there’s a

common understanding

and password has demonstrated they can

build accurate models

of human preferences using this type of

feedback

this is however expensive

but one potential way forward is to use

feedback that’s directly available in

the internet and

this is potentially much less expensive

you can potentially

uh gather more of this data and

you can potentially gather across

various tasks

so that’s what i’m really going to focus

on so the operative question is

essentially

can we train an accurate model of human

preferences

from feedback that’s gathered on the

internet

and so just to reflect on that a little

bit more

so with interactive feedback you get

some benefits right

um you can make sure the contractors or

whoever is providing the feedback has a

similar

sense of preferences that you or the

researchers or whatever the gold

standard you are

for the true preferences um is so you

can make sure there’s a close match

there so in some sense those are much

closer the internet feedback you kind of

get what’s already

out there right so if there’s a lot of

ratings of one thing being better than

another better stories

um better uh better question answers

then that sort of model of what exists

and what’s preferred is

is what you’re is what you’re capable of

learning from

um so then with this in mind

um i really wanted to focus on more

structured

task oriented uh feedback

so a lot of feedback on the united

generic where it’s sort of scattered

across various tasks so like

a like on twitter or youtube isn’t

really responding to a specific task

performance on a specific task

just as that comment or that video was

good um

but for these task oriented domains you

can actually get a clear answer of

whether a certain explanation

or a certain answer to a question was

good so

there’s a very clear sense of input

output

in particular i’ll be focusing on reddit

so reddit is the seventh most popular

site

in the us um and it’s organized into

subreddits that have particular tasks

and particular sort of

uh structures in the way that they give

feedback so they might value certain

things as a community

and i’ll focus on the community of our

slash writing props in particular

so this is a community of short story

writers

and it’s structured in the form

of a writing prompt so you get a writing

prompt if you’re

trying to respond to one of these things

it looks like this a small dragon must

defend his hoard a single coin

and then people will provide various

responses to that writing prompt and

each response will get some number of

uh upvotes down votes and together

they’ll

produce a resulting score so

these scores actually reflect some

measure

of the aggregate preferences of the

people on our such writing props

so we can try to learn a model

of those preferences using these scores

and in particular there’s a few models

that i want to train

so the first model is essentially the

generative model this is the model that

takes

writing prompts as an input and produces

a response

so this is somewhat analogous to someone

who’s browsing the subreddit and

writing and then the next

model is the evaluate model so this

would be analogous to someone who’s

lurking and provides like an upvote or

downvote

um on these stories so the evaluative

model

gets the prompt it gets uh two responses

to that prompt and simply outputs which

of the two responses is better

and then the last model in this sort of

system that i’m considering

is the gameplay model the the agent

model

so this agent model starts off from

the generative model so it’s just

something that learns to produce

stories that are similar to the stories

that it’s seen

but it’s further trained using the

feedback it gets from the evaluative

model

so the value order essentially provides

feedback

to one of two agents that are playing

against each other

and it provides that indication the form

of saying which one of the stories is

better so there’s a sequence of models

that i train here i start with

pre-trained models these are widely

available and provide a great starting

point for doing experiments

um i train the generative model i trade

the evaluative model and i combine

the generative model and validate model

to produce the agent

and then ultimately i can take those

outputs

and i can make them available to the

public to you all um for example

rewardmodeling.com which is a website i

built for this project

where i can actually gather some of your

feedback on whether

the the output of this model actually

matches your preferences

so the um the most important result is

really

how well does this reward model

generalize or how well does it actually

capture

um our preferences our model preferences

um so in order to assess this we

essentially

train this large model on some

number of comparisons and then we test

it on a set of comparisons that it

hasn’t seen before

then in particular for this project i

wanted to make sure that the model

wasn’t learning anything spurious wasn’t

learning

um how preference is based on how long

the

responses were wasn’t learning these

preferences based on how quickly they

were made

so i removed several confounders and i

filtered down to a hard test set

of roughly similar responses that were

made at roughly the same time

and uh with this i actually received a

final

accuracy model accuracy of 74.2

so to place this number in some context

right

it’s worth thinking about there’s some

inherent sort of noise

um in in sort of preferences

these are preferences for a lot of

people these are preferences that are

gathered across 10 years of reddit data

so they might vary over time um and to

understand this closer let’s take a look

at this graph which looks at how

this accuracy changes as we

increase or decrease the model size as

we increase or decrease

the number of examples it’s seen so this

on the x-axis this is the number of

samples in the training set so this is a

number

of examples that basically sees before

it does the test

right and then on the left you can see

the accuracy which is its performance on

this test set

so um i’ll draw your attention to the

performance

of gpt 2 xl which in particular is the

largest model that i trained

um so you can see here that it actually

learns the fastest of all the models but

then it essentially saturates

at this performance of around 74.2

percent

um and there’s a couple things that we

can think

draw from here um one is that larger

models

simply learn faster they can extract

more meaning from the data

that’s given to them um two

that there are continued gains from uh

from increasing

the number of samples that the model

sees um

but this is most pronounced for the

smaller model sets

um and then with this in mind

it’s really interesting to think about

what happens if we combine

data sets um across different subreddits

what happens

when we start to explore things like

transfer and this is really where i want

to take this project next

um so i think the basic idea is

what if we trained this reward model a

lot of different subreddits a lot of

different tasks

basically and tested its performance on

a task that hadn’t seen before

so many ways this more closely captures

what we want out of a model in reality

um for a lot of tasks we won’t have

the vast amount of feedback that comes

from

reddit for a lot of tasks we won’t be

able to actually gather

expensive human feedback um

so what we really want is a reward model

that’s been trained

on pre-existing uh signals

and then it actually performs well on

some evaluation that it hasn’t been

prepared for

and along these lines uh you can think

of

like a a long-term direction for this

field

sort of being like gathering feedback

from the internet is almost analogous to

pre-training

um in the common sort of language

modeling setup

and then this interactive feedback that

people gather

is analogous to fine-tuning um

in the sense that we can more carefully

construct

this set of feedback uh we can hire

contractors with particular expertise

or we can make sure that we have a broad

uh

reflection of all the feedback from

different people groups that we care

about

and that way we can make sure that the

true preferences

are more actively captured while also

receiving some performance benefit from

using

this internet feedback and with that

i’ll end on a bit of cautionary note

that um so like what preferences have we

learned so

rough reddit isn’t really representative

of the globe right it’s uh

it’s skewed in many ways right so if we

actually wanted

a model that maybe represents um

a very general sense of what we mean for

something to be a good story or not

we’re going to have to balance out this

uh this data set right

um and that will be important going

forward we’re going to want people with

more expertise in writing and we’re

going to white people with a lot of

different

influences um so with that i’d like to

thank

my mentor i’d like to thank the other

scholars especially sam

and danielle who donated lots of compute

um i’d like to thank the organizers for

making this all possible

and then i will answer questions

all right so

how would you get rid of any concern of

bias on reddit

so um i’ll answer this one live

so the i think the issue is you simply

can’t

um i think the the way to approach that

is to actually balance it with

getting other data sets um

like one one thing that i did do is i

filtered out explicit

explicit text um but

i think there’s a lot there are a lot of

issues there

so i think the future of this is

probably balancing

the um the internet feedback with a more

created

uh data set of feedback

do i think the 75 accuracy ceiling is

due to the noise and the labels or

weakness of the models

um i’d say it’s it’s potentially

a bit of both right um

i would be very interested in seeing if

uh

if a larger model with more capacity can

move past that but it does look like a

lot of the models like

the various size models were converging

around 75 accuracy

um so i actually think it’s probably

mostly noise in the labels

um and i did notice that when i

actually uh added the

the data of um of when the response was

made the delay between the submission

the response

to the uh to the language model input

the language model is actually able to

further increase its

its accuracy so it was able to take into

account both the

text itself and the speed of the

response

um and with that i think i’m at time i’d

love to answer

any other questions in the future but

that’s it for me