Scaling Laws for Language Transfer Learning ｜ Christina Kim ｜ OpenAI Scholars Demo Day 2021 ｜ OpenAI

Video

Transcript

hi everyone i’m christina kim and i’m

really excited to present my scholars

project on the scaling laws for language

transfer learning

um so throughout the open ai scholars

program i was really interested in

questions around data

what characteristics and attributes are

there and how does that impact model

performance

so for my project i looked at how do the

scaling wells look for pre-trained

english language models

as we transfer to other languages

so historically the advancement of deep

learning capabilities has been centered

around three different levers so that’s

better algorithms

faster and cheaper compute and larger

high quality data sets

given machine learning’s potential

significant impact in society deepening

our general understanding

of machine learning and how certain

factors improve models

is critical for making better

predictions of which capabilities are

going to develop next and when

further the exploration of scaling laws

evidence across these three factors

has created a way to measure the impact

of these three as they interact and

limit each other

so my projects framework is inspired by

the work on scaling laws which was

published by openai in the past year

scaling laws predict machine learning

performance as i said as a function of

model size data set size and the amount

of compute used for training

so you can think of compute data set

size and model size as different

limiting factors

that you can be changing to get better

performance and recently scaling

relationships were found for

transfer learning from pre-trained

english texas models to python

so scaling loss for transfer are

important because the scaling

relationships can help explain how to

work in a limited data regime

so in an ideal world you’re going to

have an infinite amount of data for your

models to be learning from

and by that i mean that you’re only

limited by the other two factors

compute and model size but getting a

large quantity of high quality data is a

non-trivial task

and it’s oftentimes near impossible

as a result most problems that we want

to study are actually in this

low data regime before the scholars

program i was a machine learning

engineer and i saw firsthand how costly

it is in

both time and money to get good quality

data

evaluating these trade-offs is a pretty

important and practical question that

many researchers and practitioners have

to handle

so building upon the work from scaling

laws for transfer my experiments try to

answer the question

how much does pre-training actually help

when we’re transferring across different

languages um

being chinese spanish and german and

what does that look like as we vary the

data set size and model size

so for my experiments i first had to

pre-train english language models um and

i pre-trained

decoder only transformers of size 124

million non-embedding parameters to

my smallest model size which was 3.3

million non-embedding parameters

i trained this all on open web text too

which is an

open source version of webtext which was

used to train

gpt2 i used the same hyper parameters

from the original scaling laws for

neural languages paper

except i used a 500 step warm up but the

cosine decay to 10

of the max learning rate here um the

text was encoded with the same gpt2

tokenizer um which is a byte level byte

pair encoding with a vocab size of 50

000.

and all the models were trained to about

26 billion tokens

and as you can see here um my models

exhibit scaling laws

similar to what was found in the scaling

laws for neural languages except

this line isn’t quite the linear here um

and that kind of indicates that maybe

my largest models are under trained a

bit here

after getting my pre-trained models um i

next set up my fine tuning experiments

so for

my fine tuning experiments i wanted to

focus on changing the number of tokens

and data

while holding performance which in our

case was cross entry loss and model size

constant

so for these experiments the data set

size spanned six orders of magnitude

while the model sizes span two orders of

magnitude

and i trained this on three different

languages which were chinese spanish and

german

so for the chinese data set i use this

data site called community qa

which is similar to the web text corpus

and then for german and spanish i got it

from oscar which is a multilingual

corpus

got by classifying the common crawl

corpus

so in my experiments the thing that i

really wanted to measure was the

effective data transfer

so what does that look like when we are

training from english text to chinese

spanish and german text and so the

effective data transfer

can be measured as this is the amount of

fine tuning data needed to get to this

loss when we’re using a pre-trained

model and then

this purple dotted line is the amount of

additional data that we would need

to get to that same loss when we’re

training from scratch on

this data set size so it’s

important to note here that as you can

see the amount of data transferred from

pre-training gets smaller

as we increase the number of tokens in

the data set size that we’re looking at

and eventually for this model it

converges around

10 million tokens for the data set size

so i wanted to show you what it looks

like when we actually compare these

three languages so

this is like the exciting bit here and

so you can see that for the pre-trained

english models they help the most when

we’re learning

german versus spanish and chinese

and that kind of makes sense because i

think these results reflect a lot about

the linguistic similarities between

english and these other languages so

english and german are both derived from

proto-germanic and

are linguistically most similar and

although spanish

shares many of the same symbols as the

english alphabet it’s actually in a

different family of languages

and then obviously chinese has a very

very different alphabet than

the english alphabet um and it’s very

distinct there

another thing i want to highlight here

is a bit about the shape of the lines

and the distance between them

so as you can see the effect of data

transfer for

spanish and chinese is not too different

at this initial point here for a data

set size of 8 000 tokens

however as we increase the data set size

we can see that pre-training

continues to help for another order of

magnitude

compared to chinese here

another way to think about the amount of

data

how much data is actually useful from

pre-training is to think about the

fraction of effective data of

fine-tuning so the smaller this fraction

it means more pre-training means

pre-training has helped us more

so as you can see in these graphs here

as the model size increases

um this fraction decreases all languages

which means that pre-training has become

more effective um but as we

increase the data set size this fraction

increases across model sizes

and that means pre-training has become

less effective here

a lot of these results here on this

graph show

the same points that i brought up on the

previous slide about how far apart or

maybe these distributions are

from each other and as you can see that

the german

graph here has steeper curves compared

to the spanish and chinese

and i think that indicates that there’s

more transfer

happening for german compared to the

other two languages

another interesting thing that we found

was that pre-training helps most in

low data regimes so in a low data regime

pre-training is most helpful across the

data size

across model sizes but especially in the

smaller model sizes

and you can see here as i increase the

model size with the fixed data set size

of chinese text to find uh to fine tune

on models trained from scratch on

chinese did not improve while the models

were

the models pre-trained on english

continue to achieve better performance

so you can see here that these flat

lines here are where we’re data limited

um in the setup versus when we start to

see

an increase in the slope uh we’re now

parameter limited

another important thing to note is that

pre-training

and using pre-trained models is way more

compute efficient than

uh using uh training from scratch and

you can see this here

um for this one model size for this one

data set size

i want to talk about some limitations

that some of my experiments had and so

the first one is i use the same

tokenizer for all

languages so this is an issue because as

i mentioned before the tokenizer had a

50k

vocab size and chinese um has over 50

000 characters in its uh

language so that means a lot of the

tokenization is probably quite

inefficient

and so this could impact model

performance quite a bit so i think for

future work uh you’d want to train your

own tokenizers um

and then transfer to learn from there

another point is that

it looks like from my original uh plots

for the pre-training that maybe i could

have been pre-training for longer

um then i think i could have done a more

linear line for some of the scaling laws

that i saw for the open web text models

another thing um that i would want to do

is do a more thorough

hyper parameter sweep and learning rate

sweep um

as i believe that both of these uh

limitations uh would cause

very very different results um and i

believe the numbers that i’ve gotten in

the previous slides would be very

different

had i found the ideal optimum learning

rates for the different data set sizes

and model sizes

one other note is that my data the

languages that i got are from different

sources and so i think this

experiment could be more thorough if i

had to use the same

data set source for all three of the

languages

i want to talk about some future work

that i’m really excited about after this

project

so i think one thing that could be

really interesting is to compare the

effective data transfer

as we use pre-trained models of a

different language back to english

um then you can maybe create some kind

of mapping of

how far apart are distributions from

each other is there some kind of

symmetry in the data transfer there

and what does that actually look like

another obvious next steps would be

to actually use the setup to do work in

low resource languages or other

tasks and distributions that are quite

different from english

another thing that would be very cool to

do based on this work would be to

predict the ideal ratio for pre-trained

versus fine-tune for any given problem

for some compute for some budget that

you would have

um another thing that i think would be

interesting in the same format of

experimentation will be studying the

forgetting problem in transfer learning

and see what that effective data

transfer looks like

as we um are approaching this problem

before i answer questions i want to give

some thanks to folks

i want to thank jt for sharing his

wisdom with me throughout the program

and keeping our project on track

um and for staying up late now from

poland to hear this

uh my fellow scholars especially

danielle and cujo for sharing compute

with me

and everyone that gave me feedback

throughout the process and program um

especially danny

a shout out to openai for making all

this possible

great so now i’ll answer some questions

that i have here

i have a question that says which model

architecture was used to transfer

learning

across models um and also which one was

trained from scratch

so the model architecture that i used is

the same like gpt

gbt transformer so which is a decoder

only uh

transformer

[Music]

um i have a question that says how would

you extrapolate what kinds of gains from

pre-training you’d get from models

smaller or larger than you’ve been

training

or from smaller and larger data sets um

so i think you would just be able to see

the similar trends

that we saw in my previous slides for

the different

data set sizes and i think as you as i

saw the main takeaway is that

if you have a large uh pre-training data

a fine-tuning data set um it may

uh you’re not going to get as many gains

as you would get from a much smaller

fine-tuning data set

[Music]

another question is how is my setup

related to the scaling laws for transfer

paper by danny hernandez from earlier

this year um so a lot of my work is

super inspired by

danny’s experiments there so i did the

same type of

experimentation where i was changing the

data set size as i was varying the model

sizes

and as i was comparing for the loss

between those

i had a question this last question says

did you consider transfer between other

types of languages say programming

languages

and i would actually say that you should

check out the scaling laws for transfer

paper because that actually does look

into

how does english transfer to python

[Music]

so i got another question that says did

you get a chance to study performance on

metrics other than loss

and i didn’t but i’d be kind of curious

to see how you could characterize this

on uh

downstream tasks and i think that’s like

a pretty big thing to

look at for transfer learning in

particular

uh there’s a question that says would

you like to use a different tokenizer in

the future and

uh yeah definitely i think being using

the train tokenizer on the specific

languages would

get you much better results and

therefore probably much cleaner

graphs

um and then a question that says was

there any reason you decided not to

train models smaller than two million

parameters um

not particularly i just thought much

much smaller models than that would

result

in uh losses that weren’t maybe that

interesting to look at since it would be

parameter limited um

very quickly

awesome so i think that’s all my time so

i’m going to pass it off to danielle

who’s going to be presenting her project