🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
hi everyone i’m christina kim and i’m
really excited to present my scholars
project on the scaling laws for language
transfer learning
um so throughout the open ai scholars
program i was really interested in
questions around data
what characteristics and attributes are
there and how does that impact model
performance
so for my project i looked at how do the
scaling wells look for pre-trained
english language models
as we transfer to other languages
so historically the advancement of deep
learning capabilities has been centered
around three different levers so that’s
better algorithms
faster and cheaper compute and larger
high quality data sets
given machine learning’s potential
significant impact in society deepening
our general understanding
of machine learning and how certain
factors improve models
is critical for making better
predictions of which capabilities are
going to develop next and when
further the exploration of scaling laws
evidence across these three factors
has created a way to measure the impact
of these three as they interact and
limit each other
so my projects framework is inspired by
the work on scaling laws which was
published by openai in the past year
scaling laws predict machine learning
performance as i said as a function of
model size data set size and the amount
of compute used for training
so you can think of compute data set
size and model size as different
limiting factors
that you can be changing to get better
performance and recently scaling
relationships were found for
transfer learning from pre-trained
english texas models to python
so scaling loss for transfer are
important because the scaling
relationships can help explain how to
work in a limited data regime
so in an ideal world you’re going to
have an infinite amount of data for your
models to be learning from
and by that i mean that you’re only
limited by the other two factors
compute and model size but getting a
large quantity of high quality data is a
non-trivial task
and it’s oftentimes near impossible
as a result most problems that we want
to study are actually in this
low data regime before the scholars
program i was a machine learning
engineer and i saw firsthand how costly
it is in
both time and money to get good quality
data
evaluating these trade-offs is a pretty
important and practical question that
many researchers and practitioners have
to handle
so building upon the work from scaling
laws for transfer my experiments try to
answer the question
how much does pre-training actually help
when we’re transferring across different
languages um
being chinese spanish and german and
what does that look like as we vary the
data set size and model size
so for my experiments i first had to
pre-train english language models um and
i pre-trained
decoder only transformers of size 124
million non-embedding parameters to
my smallest model size which was 3.3
million non-embedding parameters
i trained this all on open web text too
which is an
open source version of webtext which was
used to train
gpt2 i used the same hyper parameters
from the original scaling laws for
neural languages paper
except i used a 500 step warm up but the
cosine decay to 10
of the max learning rate here um the
text was encoded with the same gpt2
tokenizer um which is a byte level byte
pair encoding with a vocab size of 50
000.
and all the models were trained to about
26 billion tokens
and as you can see here um my models
exhibit scaling laws
similar to what was found in the scaling
laws for neural languages except
this line isn’t quite the linear here um
and that kind of indicates that maybe
my largest models are under trained a
bit here
after getting my pre-trained models um i
next set up my fine tuning experiments
so for
my fine tuning experiments i wanted to
focus on changing the number of tokens
and data
while holding performance which in our
case was cross entry loss and model size
constant
so for these experiments the data set
size spanned six orders of magnitude
while the model sizes span two orders of
magnitude
and i trained this on three different
languages which were chinese spanish and
german
so for the chinese data set i use this
data site called community qa
which is similar to the web text corpus
and then for german and spanish i got it
from oscar which is a multilingual
corpus
got by classifying the common crawl
corpus
so in my experiments the thing that i
really wanted to measure was the
effective data transfer
so what does that look like when we are
training from english text to chinese
spanish and german text and so the
effective data transfer
can be measured as this is the amount of
fine tuning data needed to get to this
loss when we’re using a pre-trained
model and then
this purple dotted line is the amount of
additional data that we would need
to get to that same loss when we’re
training from scratch on
this data set size so it’s
important to note here that as you can
see the amount of data transferred from
pre-training gets smaller
as we increase the number of tokens in
the data set size that we’re looking at
and eventually for this model it
converges around
10 million tokens for the data set size
so i wanted to show you what it looks
like when we actually compare these
three languages so
this is like the exciting bit here and
so you can see that for the pre-trained
english models they help the most when
we’re learning
german versus spanish and chinese
and that kind of makes sense because i
think these results reflect a lot about
the linguistic similarities between
english and these other languages so
english and german are both derived from
proto-germanic and
are linguistically most similar and
although spanish
shares many of the same symbols as the
english alphabet it’s actually in a
different family of languages
and then obviously chinese has a very
very different alphabet than
the english alphabet um and it’s very
distinct there
another thing i want to highlight here
is a bit about the shape of the lines
and the distance between them
so as you can see the effect of data
transfer for
spanish and chinese is not too different
at this initial point here for a data
set size of 8 000 tokens
however as we increase the data set size
we can see that pre-training
continues to help for another order of
magnitude
compared to chinese here
another way to think about the amount of
data
how much data is actually useful from
pre-training is to think about the
fraction of effective data of
fine-tuning so the smaller this fraction
is
it means more pre-training means
pre-training has helped us more
so as you can see in these graphs here
as the model size increases
um this fraction decreases all languages
which means that pre-training has become
more effective um but as we
increase the data set size this fraction
increases across model sizes
and that means pre-training has become
less effective here
a lot of these results here on this
graph show
the same points that i brought up on the
previous slide about how far apart or
maybe these distributions are
from each other and as you can see that
the german
graph here has steeper curves compared
to the spanish and chinese
and i think that indicates that there’s
more transfer
happening for german compared to the
other two languages
another interesting thing that we found
was that pre-training helps most in
low data regimes so in a low data regime
pre-training is most helpful across the
data size
across model sizes but especially in the
smaller model sizes
and you can see here as i increase the
model size with the fixed data set size
of chinese text to find uh to fine tune
on models trained from scratch on
chinese did not improve while the models
were
the models pre-trained on english
continue to achieve better performance
so you can see here that these flat
lines here are where we’re data limited
um in the setup versus when we start to
see
an increase in the slope uh we’re now
parameter limited
another important thing to note is that
pre-training
and using pre-trained models is way more
compute efficient than
uh using uh training from scratch and
you can see this here
um for this one model size for this one
data set size
i want to talk about some limitations
that some of my experiments had and so
the first one is i use the same
tokenizer for all
languages so this is an issue because as
i mentioned before the tokenizer had a
50k
vocab size and chinese um has over 50
000 characters in its uh
language so that means a lot of the
tokenization is probably quite
inefficient
and so this could impact model
performance quite a bit so i think for
future work uh you’d want to train your
own tokenizers um
and then transfer to learn from there
another point is that
it looks like from my original uh plots
for the pre-training that maybe i could
have been pre-training for longer
um then i think i could have done a more
linear line for some of the scaling laws
that i saw for the open web text models
another thing um that i would want to do
is do a more thorough
hyper parameter sweep and learning rate
sweep um
as i believe that both of these uh
limitations uh would cause
very very different results um and i
believe the numbers that i’ve gotten in
the previous slides would be very
different
had i found the ideal optimum learning
rates for the different data set sizes
and model sizes
one other note is that my data the
languages that i got are from different
sources and so i think this
experiment could be more thorough if i
had to use the same
data set source for all three of the
languages
i want to talk about some future work
that i’m really excited about after this
project
so i think one thing that could be
really interesting is to compare the
effective data transfer
as we use pre-trained models of a
different language back to english
um then you can maybe create some kind
of mapping of
how far apart are distributions from
each other is there some kind of
symmetry in the data transfer there
and what does that actually look like
another obvious next steps would be
to actually use the setup to do work in
low resource languages or other
tasks and distributions that are quite
different from english
another thing that would be very cool to
do based on this work would be to
predict the ideal ratio for pre-trained
versus fine-tune for any given problem
for some compute for some budget that
you would have
um another thing that i think would be
interesting in the same format of
experimentation will be studying the
forgetting problem in transfer learning
and see what that effective data
transfer looks like
as we um are approaching this problem
before i answer questions i want to give
some thanks to folks
i want to thank jt for sharing his
wisdom with me throughout the program
and keeping our project on track
um and for staying up late now from
poland to hear this
uh my fellow scholars especially
danielle and cujo for sharing compute
with me
and everyone that gave me feedback
throughout the process and program um
especially danny
a shout out to openai for making all
this possible
great so now i’ll answer some questions
that i have here
i have a question that says which model
architecture was used to transfer
learning
across models um and also which one was
trained from scratch
so the model architecture that i used is
the same like gpt
gbt transformer so which is a decoder
only uh
transformer
[Music]
um
um i have a question that says how would
you extrapolate what kinds of gains from
pre-training you’d get from models
smaller or larger than you’ve been
training
or from smaller and larger data sets um
so i think you would just be able to see
the similar trends
that we saw in my previous slides for
the different
data set sizes and i think as you as i
saw the main takeaway is that
if you have a large uh pre-training data
a fine-tuning data set um it may
uh you’re not going to get as many gains
as you would get from a much smaller
fine-tuning data set
[Music]
another question is how is my setup
related to the scaling laws for transfer
paper by danny hernandez from earlier
this year um so a lot of my work is
super inspired by
danny’s experiments there so i did the
same type of
experimentation where i was changing the
data set size as i was varying the model
sizes
and as i was comparing for the loss
between those
i had a question this last question says
did you consider transfer between other
types of languages say programming
languages
and i would actually say that you should
check out the scaling laws for transfer
paper because that actually does look
into
how does english transfer to python
[Music]
um
so i got another question that says did
you get a chance to study performance on
metrics other than loss
and i didn’t but i’d be kind of curious
to see how you could characterize this
on uh
downstream tasks and i think that’s like
a pretty big thing to
look at for transfer learning in
particular
uh there’s a question that says would
you like to use a different tokenizer in
the future and
uh yeah definitely i think being using
the train tokenizer on the specific
languages would
get you much better results and
therefore probably much cleaner
graphs
um and then a question that says was
there any reason you decided not to
train models smaller than two million
parameters um
not particularly i just thought much
much smaller models than that would
result
in uh losses that weren’t maybe that
interesting to look at since it would be
parameter limited um
very quickly
awesome so i think that’s all my time so
i’m going to pass it off to danielle
who’s going to be presenting her project