Studying Scaling Laws for Transformer Architecture … ｜ Shola Oyedele ｜ OpenAI Scholars Demo Day 2021 ｜ OpenAI

Video

Transcript

um so i’ll get started um hey everyone

um my name is shola um i’m going to be

talking about scaling laws for

transformer

architecture variants um

and so today’s talk is going to be

broken down into three main sections

first a discussion of the problem and

context behind pursuing this research

an overview of the research and any

findings from the experiments

as well as an overview of sort of future

opportunities

available for research on this topic um

at first we’re going to dive into the

problem statement behind this research

and project

specifically um resources already been

done in the space

and to provide context on this research

project

i wanted to sorry about that

specifically i want to chat about

research that’s already been done in the

space to provide context on sort of this

research project

so scaling laws for neural language

models was a recent paper that came out

of open ai

i imagine that most people in this room

are familiar with this research so i’ll

try to keep my comments brief

um this paper introduced empirical

um scaling laws for the performance of

machine learning models

as a function of a number of key

parameters

or key attributes model size which is

defined by the number of model

parameters

data set size and compute compute

referring to the total

compute budget used to train the model

as defined in the paper

compute compute is referred as the

number of

not embedding compute used during

training can be estimated

as c equals 6 nps where b is the batch

size

s is the number of parameter updates and

the factor 6 accounts for the forward

and backwards passes

so the graph to the left is from the

original paper

and illustrates how language modeling

performance

increases so it improves smoothly as we

increase

model size and data set size and compute

with optimal performance gained when all

three are sort of scaled in tandem

um and sort of given that lfc

is its own function for every model um i

consider

lines that were within five percent of

each other to be within a margin of

uncertainty

um so the further away lfc was sort of

the more consistent

significant i consider the results um

and

llc can be calculated

with irreducible loss or without but in

my experiments

i exclusively calculated it without for

simplicity

um it should be noted that

so this is lfc um the equation that sort

of represents this power line here

and this m term here is sort of the

constant term that controls the

trade-off between

um loss and compute um if you see me

looking off to my right i’m looking at

my monitor

to make sure i’m pointing at the right

thing um

so this is the existing research work

that led to my research

project while the original paper study

scaling laws on decoder only

transformers i wanted to understand how

these laws

trend among different transformer

architectures

so before we get into more background

let’s talk about motivation

um so i was interested in working with

transformers because of their impact on

the nlp space

when you add scaling laws which

introduce the ability to forecast loss

with respect to compute

i was curious to know how these laws

could generalize among

the different types of transformer

architectures

i was also curious it’s where the

constraints of scaling laws

and i thought trying to reproduce

scaling laws

um but on different types of

architectures or different types of

transformers

could tell us more about what

generalizes among

the different algorithms versus being

sort of a standalone feature of the

original decoder only transformer

so at best it would be an opportunity to

understand the connection between

transformer architecture components

and model performance at worst it would

be an opportunity to understand the

constraints of scaling laws particularly

when exploring more models

so i’ll sort of discuss next

the types of architectures i

experimented on why i chose them the

differences

and what i think we can expect from the

resulting architectures

so the transformer architectures i

studied followed into two categories

causal language modeling which is

predicting

the next token and sequence and mass

language modeling

which refers to predicting the mass word

which may be any word within

the sentence so the only difference

between the two

is the way that the model is trained so

the same architecture can be used for

both types of language modeling

and you’ll see later on in the

presentation that i did this with bert

although burt was originally released as

a mass language model

there is a causal implementation of it

and i ran my experiments on both types

so a little bit of background and i’ll

try to speed through this just because i

know this is a lot of information

but the architectures that i experiment

with are shown here

they were picked based on their

architecture and ease of implementation

so in the interest of time i’ll try to

be brief and maybe just highlight some

the more important characteristics um

i think i mentioned earlier that in that

in the original

scaling loss paper it was studied only

on the decoder only transformer

um since the data set the data set size

and machine

that i use are all different from the

original paper i decided to experiment

a transformer that was similar to what

was used in the original

paper and for that i used gbt2

for that as it was the closest to what

was in the original paper and i thought

it would be a good reference point to

put the others in context

as you’ll see the other two causal

language models

that i experiment with or transformer

excel and reformer

you’ll also see that bert like i

mentioned

was was experimented on using his mass

implementation but then also

its causal implementation as well

and so you know bur is a mass language

model that uses

random masking and um next sentence

prediction

um the only difference sort of

between the two as far as the causal and

the mass implementation is the way the

model is trained

uh meaning that the same architecture

can be used for both types of modeling

so i really want to point that out just

because it will become important

later in the presentation but um just

wanted to call that out

and then of course these last two mass

language models which are both

um sort of based off of or both sort of

similar and inspired by bert

um i’ll sort of skip that in the

interest of time

so my hypothesis is that the impact of

transformer architectural scaling laws

depends on how significantly that

architecture impacts

model size data set size and and most

likely compute

so given reformers

focus on reducing its memory and compute

during training that’s embedded in its

architecture

my prediction was that you would see

that the reformer architecture

particularly

outperformed the other models given the

insight

on the original paper on how weekly

performance trends with model shape

i predicted that bird scaling laws would

have little to no change between the

causal and the mass

version and that you would see

the all of the mass language models sort

of have scaling laws that

are within a sort of a margin of

uncertainty or sort of

um have only the difference of a

constant pre-factor

within them um so next i want to talk

about experiments

um methodology preliminary findings and

research implementations

i’m going to try and speak a little bit

faster because i think i’m running a

little bit slower on time

um so for methamp methodology um one of

the key tasks of my research project was

calculating um the compute efficient

frontier fit

that’s lfc and on transformer

architectures

doing language modeling and essentially

trying to understand

how lfc changes with respect to

algorithmic changes within an

architectural family

um so once i decided which models to

study i proposed several model sizes

and then produced a learning curve for

each run ideally training at least four

to five sizes for every model

and sort of the number of parameters

ranging between 2 million and 350

million

um with some variation between the

different um

the different architecture um this graph

highlights

sort of uh one of the typical loss

versus estimated compute

and here i just want to point out that

we sort of see this front

formation of a pareto frontier

and this sort of i would say line that’s

adjacent to this curve

is the scaling law um that i’m going to

be looking for

and then sort of evaluating between the

models

um so these are some of the preliminary

findings

um that i had i think the numbers

are less important but more so just the

relationship between one another

um so on this slide we can see that the

same architecture using a different

method of training

can produce different llc’s um so

this result was initially surprising

because i had assumed that the same

that we’d see the same lfc um because

they have the same architecture

but it actually makes sense conceptually

once you consider the same architecture

when trained differently

processes data differently so in the

case of bert

when it’s trained as a mass language

model the context is encoded

bidirectionally so you have sort of

semantic information on the left and the

right

as such is the case when you are

when you are sort of decoding mass words

both um

anywhere sort of within a sentence

versus in the causal implementation

um you only ever see words to the left

of the word that you’re predicting

um and so that was pretty surprising

in terms of um the results

um another another one that was pretty

interesting was sort of

um reform reformed a sort of like tiered

pareto frontier

meaning that some of the larger meaning

that some of the larger models

don’t perform any better than the

smallest within the same tier

but they do use more compute so

essentially these models would be

sort of needlessly more expensive um

when you could just use this one

so that i thought was um pretty

interesting

reformer did have sort of one of the

best locs

of the models i tested um

i think particularly with reformer i

would want to

sort of see how this pattern persists

with some of the other architectures

and to sort of see if this is a pattern

that you can find with

all of the transformers that all the

transformers whose architecture directly

impacts compute

um so that may be like the evolved

transformer um

versus just reformer um so some of the

limitations i had

um i did limited hyper parameter sweeps

um some of the problems that i that i

found could have been solved

with that um really calculating scaling

laws with

irreducible loss um could have made the

the fits more precise which could have

revealed additional information

um the other piece is that like the

nature of this research in general is

that some of the comparisons are just

simply

apples to oranges and that there are way

too sort of

variations within the two different

architectures um and so

i think the next time i would really

want to to figure out a way to

isolate um as much of the differences as

possible so you know

what in particular is driving the

changes in lfc

um and last sort of why it matters um

so the architecture that scales best

is the most cost effective model to use

this is why this research matters

because it allows us to find the

architecture

to find that architecture and

potentially use our findings to

understand

what future architectures could look

like that continue to optimize

cost as we can see over time

the gap widens in terms of

the the gown widens in terms of

model perform the model performance you

receive

for the same amount of compute um and if

the slope is steep enough

and architecture can consistently begin

to outpace other architectures within

the expected range of compute

um so this sort of matters because

currently we’re seeing ever increasing

amounts of acute uh

compute being used in the industry um

these scaling laws

um allow for up to allow for us to

optimize

large computer regimes and choose the

best architecture and model size that

helps reach this aim in my experiments

the reformer model was that architecture

that

produced the best power laws in

comparison to the others

the reformer model was designed to be an

efficient transformer and utilizes

several techniques to

reduce memory footprint and compute time

i think this directly i think this

direct impact on compute time is likely

why we see

such improved performance with it was

the only transformer architecture i

experimented with that had this property

but in my next experiments i’d like to

explain

expand the architectures i consider to

similar

styles of transformers and so

what’s next for me is really just

continuing this study

doing more hyper parameting to expanding

the list

of models that i’m using and really just

expanding the experiment suite

to be able to more exhaustively draw

correlations and draw insights as far as

the connection between model performance

and compute

with respect to transformer architecture

um and last but not least lastly i just

wanted to say thank you um

i think it’s been amazing time at open

ai it’s been an incredible opportunity

um i really wanted to say a special

thank you to my mentors aaray and nick

and really thankful that i’ve been able

to have two um everyone else at openai

mariah christina pamela kathy

alithea i also wanted to say

special thanks to the tools that helped

made my research possible

um it would not have been possible to do

this

without open source tools like hugging

face deep speed and azure

and i think i might be out of time

so we’ll see what

um what francis says but if anyone has

any questions feel free to ask um

if not we’ll see what happens with

respect to time

yeah you can take one question okay um

do we have any questions

um okay so i see a question here that

said any

intuition as to the tier structure of

the of the reformer

um i don’t think so i think that

so i think my only hypothesis would be

that um

because sort of the reformers

differences

because the performance differences are

sort of related to using tricks to

specifically

reduce memory and compute um my

thought is that the hyper parameters

matter a bit more with

reformers specifically that might not

necessarily

be the case with burt and transform

excel in any of the others

um so i couldn’t isolate which one

specifically that was sort of driving

that change but that’s sort of my

thoughts behind it