Words to Bytes： Exploring Language Tokenizations ｜ Sam Gbafa ｜ OpenAI Scholars Demo Day 2021 ｜ OpenAI

Video

Transcript

all right um hey everyone my name is sam

and i’m excited to talk to you guys

about the project i’ve been working on

for the last six months

um word to bites exploring language

situations

um so i’m going to talk a bit about my

background uh then

dive into project context uh then dive

into the project

and then talk about what i’ve learned

just during my time at open ai

all right um so um i’ve worked at

software engineering for last couple

years and i was working on my startup

level

which is a platform for choice-based

stories where people could create their

own 3d adventure stories

and other people could play them during

this time i encountered gt2 and i

thought it’d be really cool to enable

writers to create

kind of ai generated stories um gpc2

didn’t have the greatest completions

but it did get me really interested in

language models um

in other generative models um so i was

interested in

multimodal uh explorations and maybe

videos that could

uh models that could um

maybe generate future videos from

current video or future audio from

current audio but i quickly realized

that that’d be pretty challenging um

so um well uh so while in scholars

program i explored

lots of different things um explore

sequence models like lstms and

transformers

i learned the basics the basic problems

and solutions that occur in deep

learning

in deep networks like vanishing

exploiting gradients learn about meta

evolution

resonance a bit about reinforcement

learning um and throughout this time i

was always interested in

being able to apply some of these

techniques to learning multiple

modalities

particularly because it’s how we learn

as humans um we don’t just read text we

have visual and auditory experiences

that help us create

our own internal models and i was

interested in how machines can do that

um so because uh our audience is really

general i wanted to give some context

to my project um so first i wanna talk

about sequence modeling

uh sequence models are used to explore

data where your input or output data

has a particular sequence that encodes

information

um so common example is something like

siri where you’re like hey siri turn on

the lights

um or you can have an example um like

deepmind is doing where you’re taking

going from a dna sequence

to a folded protein um

another thing i want to uh kind of

introduce is the idea of unsupervised

learning particularly with particularly

with other aggressive models

um so auto aggressive sequence model uh

predicts current or future values uh

based on past values

so um uh gt3

is open ai’s one of their

popular language models um and

my the model i train is really similar

so you have a training corpus

um right now i can consider all of

wikipedia as an example

as a particular article about the

titanic um you could break

you break up your corpus by taking maybe

the first 64 words in next 64 words

and come up with training examples um so

with each example

um you’re basically giving it to your

model and having your model predict each

word

um and basically you’re giving your

model these examples over and over and

over again

and ideally as it’s going through these

training examples it’s learning the

relationships there

um so if you were to perform inference

and give your model

uh an example and say that satanic was

uh your model might complete it and say

it was the largest ship

or was a british passenger liner

um so diving into my project

so my project basically looked at

sequence models

and particularly tokenizations on those

sequence models

um i looked at some previous works on

other language models

and there are some interesting findings

that led me to focus on tokenization

uh the first was that fine grain finer

grain tokenizations outperformed larger

levels of organizations

so um if you have uh sub words in your

vocabulary

more sub words then those models

outperformed

models with just really big words in the

vocabulary um and additionally learning

the segmentations

could lead to better generalizations and

i’ll talk a bit about what i mean by

that

so for my project i took a look at

different tokenizations on the same

dataset

the data i used included articles from

wall street journal and articles

wikipedia

i looked at tokenizations i looked at

words sub words and character

tokenizations

uh and each tokenizer was pre-trained on

the training data

um so i want to give some examples of

what i mean by tokenization

um so i have an example sentence we want

swimming to mitigate the effects

of the blistering sun um so let’s look

at words organization of this

um so here we went swimming to mitigate

the effects of blistering sun

um each space um so the tokens are

separated by white space

uh each space is represented by this

underscore um

and so yeah it’s pretty pretty

straightforward uh a subword

tokenization

a bricks your uh it breaks your work

your sequence up into sub words

uh so here you can see went is split

into two sub words and likewise swimming

uh the ing is separated from

uh swim uh and also at the end

blistering you can see it’s broken up

into three subwords

uh so this allows your model to learn

the relationships

between parts of words uh

most english speakers know that ing

kind of means that you’re doing a

particular verb

um so allows your model to also build uh

understanding of that kind of

relationship as well

so what happens if you tokenize like

even smaller segments

uh so you can look at character

technicians um so here

each character is broken up or the

sequence is broken up by each character

um multilingual models are really

uh can have improved performance by uh

tokenizing on

characters um due to the nature of how

different languages are broken up

um maybe if you have uh

like a pictographic language like

chinese uh versus english you maybe

don’t want to break it up into

into words um and then

i also took a look at uh bytes

optimizations um

so this is the same sequence in

represented as bytes uh it’s

functionally the same as a character

tokenization when you look at english

and this is because unicode uh encodes

characters

as one to four bytes and english

characters are usually encoded as one

byte

so for example we have hui is meeting um

hue is one character

but translates to three characters um so

if we had a

uh multilingual corpus then by

tokenization would be totally worth

looking at

uh but i decided not to so um

for my project i use a 12-layer decoder

only transformer

uh it’s about 80 million parameters um i

looked at pantry bank data

on word subword with 40 000 uh

vocabulary and character segmentations

uh the amount of compute was all

constant

and it was the same model and context

length and uh

so we’ll talk about some of those

results here we’re looking at the

training perplexity

so perplexity is a measure of how well

your model um the degeneration of your

model like how good they are

um so high perplexity uh your model

might generate something like

i fell off the boat and into a porcupine

um versus i fell off the boat and into

the water

um so the first statement is really

hyperplexity

second statement makes more sense lower

complexity um it’s really hard to see

what’s going on here um so let’s like

take a look um at some of these training

steps

so um like zoomed in so here you can see

um your word perplexity is generally

lower um at your lowest and it’s

increasing or decreasing

um the fastest uh with training time

um your you would ex so i would expect

when performing this uh this experiment

i expected subworks to outperform

characters

um but i here they don’t and

uh it’s partially because our training

corpus is relatively small

so using a tender tree bank data set you

have

about 10 000 vocabulary words um so

having a 40 000

vocabulary sub word um is a bit

it’s a bit high and so it’s your your

character

models perform actually better than your

subwoofer models um and so i’ll talk

about ways that

uh we could prevent that additionally

um there was our validation for

flexibility is really high um

so this was a one run among many

several runs uh showed this relationship

between words upwards and characters

um however it’s just to show that um

our model initially overfit but in

regularizing uh it wasn’t regularized

we weren’t generalizing well in

this particular run so

some of our findings were that smaller

segmentations can have more nuanced

representation

but you need a larger model to capture

these relationships well

it’s partially because these transformer

buildups builds up

its representation in the earlier layers

so with characters if you have uh

you you have to maybe build a

representation of a word

before you are able to predict the next

word

um another thing worth considering

um in the project is to vary the context

length so if you have

um in our example from earlier um

it was 11 words but it was 59 characters

so in order to represent the same amount

of data uh your context length

the same amount of uh yeah i guess the

same amount of data your context length

needs to be

uh longer for these smaller

segmentations

um this number of subwords is a really

important hyperparameter uh when doing

these comparisons so it’s worth

including multiple subword tokenizers

as you do a sweep

and then also uh larger and more diverse

data sets should be explored

particularly if you’re going to explore

a byte level tokenizations um

and so i want to talk a bit about just

what i learned throughout the entire

scholars program um

so uh i come from engineering background

and not really research and so it was

really great for me to

learn how to identify and get the most

out of uh just reading papers

um that was like probably uh one of the

biggest takeaways for me is just

being able to take in a paper and uh i

didn’t

identify what’s useful there um i

learned just about building various

models and understanding what different

architectures are doing

um i’m i’ve always been interested in

just like software architecture

and architecture of models is a really

interesting place to explore

um so just learn all yeah just learned a

lot there

um i learned about getting your data

right and just how small issues and your

in your data can really blow up in a

deep network and you could kind of spend

some time trying to figure out what’s

going wrong and it could just

be in your data um i was in a place

where my model was terrible and i

couldn’t learn

um so i learned all about overfitting

and hyper parameter optimization

um there are a lot of little subtle

details and it’s a really iterative

process where

maybe you’re cheating a running race

scheduler you’re

continuing your optimizers um just

really tweaking a lot of stuff

but when you really get it right you see

these exponential improvements and it’s

really awesome

um i also learned about regularization

where your model is really

learning the training data but trying to

get your model to generalize

and learn something real um is like it’s

huge

it’s its own uh particular challenge and

um the last thing i want to talk about

was that i

just learned and thought a lot about the

implications of these generative models

um before i like joined i was really

excited about these generative models

and like just

super cool didn’t really think about

like the implications of

releasing them um but just being open ai

and um

talking to people and just reading a lot

about this it kind of gave me this

perspective of really thinking about

what the

what are the implications of the models

we create and their impact on

like society and democracy um so that

was really great

um yeah that was um yeah

that was a lot um but i want to thank uh

i want to thank you guys for listening

i want to thank my mentor arvind um just

he was super helpful

for all this i was very patient and gave

such great insights

um and i want to thank my fellow

scholars for just being here with me

um so with that uh let’s dive into

q a

okay um those size vocabulary for the

different organization schemes

um so

um for the

um for the word uh so the tokenizers

were pre-trained um

and they learned the vocabulary uh

as they went through uh the pre-training

so for the word tokenization um pinch

rebank had about ten thousand

vocabulary words uh the sub-word

tokenization

had 40 000 vocabulary words and the

uh character tokenization actually had a

much smaller vocabulary

um because it was they only learned the

unique characters

um so the character vocabulary uh

was i think about i think it was over

50 characters i don’t know the exact

number

um uh how does

exploring tokenizations relate to

multimodal models

um well

um i guess to share that um

sorry um so uh

i wanted to um i guess i set out to do a

scaling law suite

and to look at uh these different

tokenizations and how they scale

um and this was so that i could then

maybe learn the segmentations

um and then see how learning and

segmentation improve

the model um

so by learning the segmentations

um in in text i thought there may be

some insights

in learning the segmentations and um

in other modalities um there’s previous

research that suggests

uh if you learn the segmentations you

can really improve your performance

um so like multi multilingual

translations um seem to be improved by

uh learning and segmentation so going

from english or spanish to japanese

outperform english to japanese or

spanish to japanese if you learn the

segmentation

um so i was really interested in

learning segmentation but i first wanted

to kind of get a baseline

of what the current tokenizations did so

that’s kind of the

projecting my project

see think i’m at time

um but

yeah um if anyone wants to reach out to

me um uh feel free

to reach out to me over email um and

yeah with that

i would like to introduce showa and