Wolfram - What is ChatGPT doing...and why does it work?

🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance

okay hello everyone well usually in this time slot each week i do a science and technology q a for

kids and others which i’ve been doing for about three years now where i try and answer arbitrary

questions about science and technology today i thought i would do something slightly different

i just wrote a piece about chat gpt uh what’s it actually doing why does it work i thought i would

talk a bit about that here and then throw this open for questions and i’m happy to try and talk

about some uh all things kind of chat gpt ai large language models and so on uh that i might know

about all right so bursting onto the scene what a couple months ago now was our friend chat gpt

i have to say it was a surprise to me that it worked so well i’d been kind of following the

technology of neural nets for i’ve worked out now 43 years or so and there have been moments of

significant improvement and uh long periods of time where kind of it was an interesting idea but it

wasn’t clear where it was going to go the fact that chat gpt can work as well as it does can

produce kind of reasonable human-like essays is quite remarkable quite unexpected i think even

unexpected to its creators and the thing that i’m want to talk about is first of all how does

chat gpt basically work and second of all why does it work why is it even possible to do what

has always seemed to be kind of a pinnacle of human kind of uh intellectual achievement of

you know write that essay describing something why is that possible i think what chat gpt is

showing us is some things about science and about language and about thinking things that uh we kind

of might have suspected from long ago but haven’t really known and it’s really showing us a piece of

sort of scientific evidence for this okay so what what what is chat gpt really doing basically the

uh the the kind of the um uh the starting point is it is trying to write reasonable it is trying to

take an initial piece of text that you might give and is trying to continue that piece of text

in a reasonable human-like way that is sort of characteristic of typical human writing

so you give it a prompt you say something you ask something and it’s kind of thinking to itself

i’ve read the whole web i’ve read millions of books how would those typically continue

from this prompt that i’ve been given what’s the what’s the sort of the reasonable expected

continuation based on kind of some kind of average of you know a few billion pages from the web

a few million books and so on so that that’s what it’s that’s what it’s always trying to do it’s

always trying to uh continue from the initial prompt that it’s given it’s trying to continue

in sort of a statistically sensible way so let’s say let me uh start sharing here um let’s say that

um uh you had given it the um you had said initially the best thing about ai is its ability to

then chat gpt has to ask um what’s it um what’s it going to say next now what one thing i should

explain about chat gpt that’s kind of shocking when you first hear about it is those essays

that it’s writing it’s writing it one word at a time as as it writes each word it doesn’t have

a global plan about what’s going to happen it’s simply saying what’s the best word to put down

next based on what i’ve already written and it’s remarkable that in the end one can get an essay

that sort of feels like it’s coherent and has a structure and so on but really in a sense it’s

being written one word at a time so let’s say that the the prompts have been the best thing

about ai is its ability to okay what’s chat gpt going to do next well it’s uh what it’s going to

do is it’s going to say well what’s what what what should the next word be based on everything

i’ve seen on the web and etc etc etc what’s the most likely next word and it knows certain

probabilities um what it figures out are probabilities it says learn has probability

4.5 percent predict 3.5 and so on and so then what it will then do is to put down the next

the next word it thinks it should put down so one strategy it could adopt is i’ll always put down

the word that has the highest probability based on what i’ve seen from the web and so on

it turns out um that particular strategy of just saying put down the thing with the highest

probability um doesn’t work very well nobody really knows why one can have some guesses um

but it’s something where if you do that you end up getting these kind of very flat often repetitive

even sometimes word-for-word repetitive kinds of essays so it turns out and this is typical of what

one sees in a kind of a large engineering system like this there’s a certain kind of touch of voodoo

that’s needed to make things work well and one piece of that is saying don’t always take the

highest probability word take some with some probability take a word of lower than lower than

highest probability and there’s a whole mechanism it’s a usually called its temperature parameter

um temperature um sort of by analogy with statistical physics and so on you’re kind of

jiggling things up to a certain extent and uh the higher the temperature the more you’re kind of

jiggling things up and not just doing the most obvious thing of taking the highest probability

word so it turns out a temperature parameter of 0.8 apparently seems to work best for producing

things like essays so okay well let’s see what it what it takes um one of the things that that’s

that’s nice to do is to kind of to get some sort of concrete view of what’s going on um we can

actually um uh start looking at um uh sort of on on our computer what what’s it doing i i should say

that this um what i’ll talk about here is is is based on this piece that i wrote um that just

came out a couple of days ago um and uh the um and and i should say that every every piece of code

there is is click to copy so if i if i click every every picture is click to copy if i click this i

will get a piece of wolfram language code that will generate that let me go down here and start

showing you um kind of what um uh um how how this really works so what um chat gpt in the end

is um uh oops not seeing screen interesting oh well okay that’s oh there we go okay well i was

let me let me show you again then what um uh what i was showing before this is the the piece that i

wrote and i just wanted to emphasize that every every picture and so on that’s in this piece

has clicked to copy code just click it paste it into a wolfram language notebook on a desktop

computer or in the cloud um and you can just run it um okay so let’s see how let’s see let’s

actually run uh an approximation at least to chat gpt so open ai produced a series of models over

the last several years um and chat gpt is based on the gpt 3.5 i think model um these models got

progressively bigger progressively more impossible to run directly on one’s local computer um this is

a small version of the chat of the gpt2 model which is something you can just run on your computer

and it’s a part of our wolf neural net repository um and you can just uh uh pick it up from there

and um uh this this is now the um kind of the the neural net that’s inside um

uh a simplified version of chat gpt and we’ll talk about what all of these innards really are

later but for now um we can uh um just do something like say let’s use that model and let’s

have it tell us the um the the the words with the top five probabilities um based on uh the

starting prompt uh the best thing about ai is its ability to so that’s that’s that those are

the top five words let me let me i probably can ask it 20 words here so let’s say um let’s see

these are probably sorted right we probably want to sort these in reverse order um and uh uh this

will now show us the um uh oh i see this is this is sorting okay so this is um this is showing us

uh these words with different probabilities here um actually confused by why this didn’t oh i know

i didn’t i know i didn’t do that i know i didn’t do that um let me just uh make this do what i

expect okay here we go so this is um this is that sequence of words um uh it’s now by the by the

20th word we’re getting down to keep i don’t know let’s let’s go just for fun let’s go find out what

the 50th word was okay so down here we’re we’re um uh we’re seeing words that were thought to be

less likely what does it mean to be less likely it means that based on essentially chat gpt’s

extrapolation from what it has seen on billions of documents on the web this is the word which

these are the words which are uh have certain probabilities of occurring next in that particular

sentence okay so now let’s say we want to uh we want to go on we want to say um let’s let’s say

we want to say the best thing about it is ability to and the next word it might pick might be learn

but then what’s the word it’s going to pick after that well we could we could figure that out

by just saying um here let’s uh let’s say the next word was learn okay then let’s say that what we

would get next we’ll fill in the learn there and we just say let’s get the next top five probabilities

for the next word okay so the next word is from that’s the most probable next word is from so we

could say learn from and then the next most probable word is experience all right so let’s write

a piece of code that automates that we’re going to uh nestedly apply this function that is just

taking the um the the most likely word so to speak let’s do that 10 times um and uh this is this is

now the um uh what we get this is using the the gpt2 model um this is asking what the most likely

continuation of that piece of text is okay so it there there it goes now this is this is the case

where it’s always picking the most probable word as i said before um it uh um it very quickly ends

up um in the in this zero temperature case it very quickly ends up getting itself kind of tangled in

some in some loop let’s see if i have the example of what it actually does in that case um the uh

um let’s see uh yeah here we go um and um

this um this is not a particularly good uh impressive essay and it gets itself quite

quite tangled up if you don’t always pick the most probable word things work much better um so

for example um uh here are some examples of what happens when you use this temperature to kind of

jiggle things up a bit and um not always pick the most most the word that’s estimated as most

probable um it’s worth realizing that there’s a i showed you a few examples of um um of less

probable words there’s a there’s a huge spectrum of how of different words that can occur with

progressively lower probabilities it’s kind of a a typical observation about language that the the

which you see here as well that the nth most common word has probability about one over n

and that’s what you see for the word that will follow next and you also see that in general for

for for words and text okay well we can um uh we could ask what happens in the zero temperature case

for a um uh let’s see for for um uh for the actual um um uh gpt3 model um this is uh this is what it

does for zero temperature now one feature of this is if you use um um well for example uh this is a

a link to the api for open ai um that’s in our packet repository um if you use that link and

you simply call um gpt3 it will because this is always picking the most probable word it’ll be

the same every time so there’s no there’s no randomness to this what happens usually when

you’re picking uh this these words with when you’re picking non-zero when you have non-zero

temperature and you’re picking words that aren’t always the most probable word is there’s a certain

randomness that’s being added and that randomness will cause you to get a different essay every time

and that’s why if you say regenerate this essay most likely you will get a a different essay every

time you regenerate every time you press that regenerate button because it’s going to pick

different random numbers to decide which uh which of the of the words ranked words it’s going to um

it’s it’s going to use so this is a typical example of a temperature 0.8 um type um essay

generated by gpt3 okay so the next big question is we’ve got these probabilities um for words and so

on where do those probabilities come from so what i was saying is that the probabilities are basically

a reflection of what’s out there on the web and those are the things that chat gpt has learned

from it’s trying to imitate the statistics of what it’s seen all right so let’s take some simpler

examples of that um let’s say we’re dealing not with so chat gpt essentially deals with putting

down words at a time actually they’re they’re they’re pieces of words but we can assume for

the simpler cases they’re just words um but what if let’s start off to understand this let’s start

off thinking about putting down individual letters at a time so first question is um if

we’re going to just put down letters uh one at a time what is the um uh with with what probability

should we put down what letter how do we work that out okay let’s pick some random text let’s pick

the wikipedia article about cats and let’s just count letters in the wikipedia article about cats

and we see that you know e is the winner a is the is the runner up t comes next um that’s so based on

uh the the sample of english text from the wikipedia article about cats this is what we would

think about the statistics of of different letters let’s try the wikipedia article about dogs um

okay we have uh probably slightly different we have an o shows up more uh with higher probability

probably because there’s an o in the word dog and so on but if we keep going and we we say well what

about um uh really so that’s for these specific samples of english let’s let’s keep going let’s

let’s um uh let’s make um uh a um let’s see there we go let’s um let’s use a a very large sample of

english let’s say we have a few million books and we use that as our sample of english and ask what

are the uh probabilities for different letters in that very large sample and we’ll see what many

people will will immediately know that e is the most common letter followed by t a etc okay so

these are our probabilities so now let’s say that we want to just start generating uh generating text

according to those probabilities so this is um let’s see this is probably just yeah just um

you just fill those in oh there we go there are the frequencies and let’s just fill in let’s just

have it start generating letters this is just generating letters um according to the probabilities

that we get from um uh uh from the occurrence of those letters in english so that was asking it to

generate 500 letters with the correct probabilities to correspond to english text that’s really bad

english text there but that’s um uh that’s that should have the number of e’s should be about 12

the number of t’s should be about nine percent and so on okay we can make it a little bit more

like english text by going ahead and let’s fill in let’s append a certain probability to have a space

and now we can let’s let’s make a bigger version of this um and now uh this is generating um

quotes english text with the correct probabilities for letters and spaces and so on

um we can make it a little bit more realistic by uh um by having it be the case that um uh the um

uh the the um the word lengths in this case here we’re just chopping it into words by saying

there’s an 18 chance that a character is a space which is um here what we’re doing is we’re saying

let’s let’s insist that words have the correct distribution of lengths and this is now the text

that we get where the words have the correct distribution of length the letters have the

correct probability of occurring with e being the most common and so on clearly clearly not english

clearly a lose if if chat gpt was generating this it would be a fail um but this is something which

at the level of individual letters is statistically correct if we said um if we asked you know can you

tell that this isn’t english by just looking at the chances of different letters um it would say

this is english um and and different languages for example have different characteristic

signatures of frequencies you know if we were to pick this or i don’t know what um you know

i’m sure if we pick this for english and we were to do the corresponding thing for let’s say which

we’d pick let’s try uh spanish here for example um and uh um we’ll get slightly different uh

frequencies okay those are those are somewhat similar but not quite the same okay so that’s

what happens if um uh this is sort of generating english text with the correct single letter

statistics we could just plot the um the uh just plot the um probabilities for those individual

letters oh boy more complicated than it needed to be um okay that’s just uh um that’s just the

probability for uh each letter to occur so e is the most common q is very rare etc in this case

what we’re assuming is that every letter is sort of picked at random independently however in actual

english we know that’s not the case for example if we’ve had a q that’s been picked then with

overwhelming probability the next letter that will occur is a u and similarly other kinds of

combinations of letters other kinds of two grams other kinds of uh pairs of letters so we can

instead of asking for the probability of just an individual letter we could for example say

what’s the probability for a pair of letters um coming together see here we go um so this is this

is asking um uh this is saying given that the letter b occurred what’s the probability for the

next letter to be e so it’s fairly high the probability for the next letter to be f is very

low over here when there’s a q the the probability for next letters is only substantial when there’s

a u um as as the next letter so that’s that’s what it looks like to have um um that that’s what the

um this combination of pairs of letters the probabilities for combinations of pairs of

letters so now let’s say that we try and generate text letter at a time um with uh not just dealing

with the individual probabilities of letters but also the probabilities of pairs of letters

okay so now we do that and um it’s going to start looking a bit more a little bit more like real

real english text there’s a couple of actual words here like on and the and well tesla i guess is a

word of sorts um and uh uh this is this is now sort of getting a bit closer to to actual english

text because it’s capturing more of the statistics of english we can go on instead of just dealing

with the having the correct probabilities for individual letters pairs of letters and so on

we can go on and say let’s have the correct probabilities for uh triples of letters combinations

of four letters and so on um the uh and this is um um actually this these numbers are probably

off by one because those are really letters on their own these are pairs of letters and so on

so this is uh six tuples of letters and we can see that by the time you’ve got by the time you’re

saying i want to follow the probabilities for for six tuples of letters we’re getting complete

english words like average and so on and the fact that that’s how it finishes that’s why autocomplete

when you type on a phone or something like that can work as well as it does because by the time

you have aver there’s there’s really only there’s only a limited number of words that can follow

that and so you’ve pretty much determined it and and that’s that’s how the probabilities work

when you’re dealing with with blocks of letters rather than rather than small numbers of letters

okay so that’s kind of the idea um of sort of you’re capturing the statistics of letters the

statistics of sequences of letters and you’re using that to randomly generate kind of text like things

so let’s um uh we can also do that uh not just with probabilities of individual letters with

probabilities of words so in english there are maybe 40 or 50 000 sort of fairly commonly used

words and we could simply say uh based on some large sample from millions of books or something

what are the probabilities of those different words and and the probabilities of different words

have changed over time and so on but let’s say we we we say what what let’s say over the course of

all books or for the current time what are the probabilities for all those let’s say 50 000

different words and now just start generating sentences where we picked those words at random

um but with the with the probabilities that correspond to the uh frequencies with which

they occur in sort of these samples of english text so there’s a sentence we get by by that

method and it’s a sentence where well these words are you know occurring with the right probability

this sentence doesn’t really mean anything it’s just a collection of random words now we can do

the same thing we did with letters instead of just saying we use a certain probability for an

individual word we say we correctly work out the probabilities for pairs of words based on

our sample of english text and so on we do that it’s actually a computationally already

comparatively difficult thing to do this even for pairs of words because we’re dealing with

sort of 50 000 squared different possibilities etc etc but now let’s say we start with a particular

word let’s say we start with the word cat that’s our sort of uh uh prompt here um then these are

sentences that are generated with the correct probabilities for pairs of words so we’ll see

things like the book and um well throughout in is a little bit bizarre but um confirmation procedure

i guess those are that’s a pair of words that occur together a bunch in at least in the in the

uh in the place where all this text was sampled from so this is what you get when you’re sampling

text sort of pairs of words at a time this is kind of a very pre kind of chat gpt this is a

a very sort of super minimalist version in which it’s just dealing with statistics of pairs of

words as opposed to the much more elaborate stuff that it’s that it’s really doing now you could say

well how about to to do something uh more like what chat gpt does let’s just instead of picking

pairs of words let’s pick combinations of five words or 20 words or 200 words you know let’s let’s

ask it to given the prompts that we’ve specified let’s ask it to add in the next 200 words with

the probability that at the with what you would expect based on what’s out there on the web

web so maybe we just make a table of what’s the chance of having this three-word combination

four-word five-word combination okay here’s the problem with that the problem is there just isn’t

enough english text that’s ever been or text of any language that’s ever been written to be able

to estimate those probabilities in this direct way well in other words the um by the time you’re at

um you know i said there may be 40 000 common english words that means the number of pairs

of words that you have to ask the probability of is 1.6 billion the number of triples is 60 trillion

and you pretty quickly end up with something where you couldn’t possibly there there just

isn’t enough text that’s been written in the few billion web pages that exist and so on to be able

to sample all of those 60 trillion triples of words and say what’s the probability of each one

of these triples by the time you get to like a 20-word essay uh you you’re dealing with the

number of possibilities being more than the number of particles in the universe you wouldn’t even be

able to record those probabilities even if you had text you know written by sort of an infinite

collection of monkeys or something imitating humans that was able to do that so how do we

deal with this how does chat gpt um the um it’s um uh uh how did um uh how does it deal with the

fact that it um it can’t sample from the web enough text to be able to just make a table of

all those probabilities well the key idea which is a super old idea in the history of science

is to make a model what is a model a model is something where you’re kind of summarizing data

you’re summarizing things in a way where you don’t have to have every piece of data

you can make you can just have a model which allows you to predict more data even if you

didn’t immediately have it so quintessential example very early example of modeling was galileo

late 1500s you know trying to figure out things about objects falling under gravity and you know

going up the tower of pisa and dropping cannonballs off different levels on the tower of pisa and

saying how long does it take for these things to hit the ground so he could make a plot um gosh

that’s a remarkably complicated way to make this plot okay um could make a plot of uh you know i

don’t know how many floors there actually are in the tower of pisa but but um imagine there were

this number of floors you make a plot and you could say uh measure you know in those days by

taking his pulse or something how long did it take for the cannonball to hit the ground and so this

is um as a function of what floor it was dropped from how long it took the cannonball to hit the

ground so there’s data about specific times for specific floors but what if you want to know

how long would it take for the cannonball to hit the ground if you were on the the 35th floor

which didn’t happen to have been explicitly measured so this is where kind of the idea of

well let’s make a model comes in and sort of a typical thing you might do is to say well let’s

just assume that it’s a straight line assume that um uh that the the time to hit the ground is a

is a function of the of the floor and this is this is the best straight line we can fit through that

data this allows us to predict um what uh uh what the time to to hit the ground from from a floor

that we didn’t explicitly visit will be so essentially this this um this model is uh is a

way of sort of summarizing the data and summarizing what we expect to do when we continue from this

data the reason this is going to be relevant to us is as i mentioned there isn’t enough data to

know these probabilities for different words just from actual text that exists so you have to have

something where you’re making a model where you’re saying assume this is sort of how things generally

work this is how we would figure out the answer when we haven’t explicitly made a measurement

so you know we can make different models and we’ll get different results so for example

we could say you know here’s a here’s another model that we might pick this is a quadratic curve

um uh through these these particular um data points now it’s it’s worth realizing that there’s

there’s no modelless model you’re always making certain assumptions about how things work

and in the case of these problems in physics like dropping balls from from towers and so on

we have a pretty good expectation that these sort of simple mathematical models mathematical

formulas and so on are likely to be things that will work doesn’t always happen that way you know

this is another mathematical function this is the best version it has some parameters in this model

this is the best version of that model for fitting this data and you can see it’s a completely crummy

fit to this data so if we assume that this is sort of in general the way things work

we won’t be able to correctly reproduce what this what this data is saying

um the in the case of this model i think it has three parameters that are trying to fit this data

and doesn’t do very well um and uh in the what chat gpt is doing it basically has 175 billion

parameters that it’s trying to fit to make a model of human language and it’s trying to hope that

when it has to estimate the probability of something in human language that it does better

than this that with its 175 billion parameters that the underlying structure it’s using is such

that it’s going to be able to more correctly than this for example estimate the probabilities of

things um so let’s see all right so the next big thing to talk about is uh doing things like

dropping balls from towers of pisa and so on that’s something where we’ve learned over the last

over the last 300 years since galileo and so on that there are simple mathematical formulas that

govern those kinds of processes physical processes in nature but when it comes to a task like what’s

the most probable next word or some other kind of human-like task we don’t have a simple kind

of mathematics style model so for example we might say here’s a typical human-like task

we’re given um we’re asked to recognize uh from an array of from an image an array of pixels

which which digit out of the 10 possibilities is this is this one and and so we um uh and and no

we humans do a pretty good job of saying well that’s a four that’s a two and so on but uh we

need to ask sort of how how do we think about this problem so one thing we could say is let’s try and

do the thing that we were doing where we say let’s just collect the data and figure out the answer

based on collecting data so we might say well let’s let’s get ourselves a whole collection of

fours and let’s just ask ourselves um when we are presented with a particular array of pixel values

does that array of pixel values match one of the fours that we’ve got in our sample the chance of

that happening is is incredibly small and it’s clear that we humans do something better than

that we don’t it doesn’t matter where the individual pixels fell here so long as it

roughly is in the shape of the four we’re going to recognize it as a four so the question then is um

how does that work and uh what um what’s what we found is that um uh it’s um well let’s say

this is using uh this is actually using this sort of a standard machine learning problem

um this is using a simple neural net um to uh recognize these handwritten digits and so we

see it gets the right answer there but if we say well what’s it really doing let’s say we give it

a set of progressively more blurred digits here at the beginning it gets them right

then it quotes gets them wrong what does it even mean that it gets them wrong

we know that this was a two that we put in here and we know we just kept on blurring that two

and so we can say well it got it wrong because we knew it was supposed to be a two but

if we sort of zoom out and ask what’s happening at a at a broader level we say well if we were

humans looking at those images would we conclude that that’s a two or not by the time it gets

blurred enough we humans wouldn’t even know it’s a two so to to sort of assess whether the machine

is doing the right thing what we’re really asking is does it do something more or less what like

what we humans do so that becomes the question is it not we don’t get to ask for these kind of human

like tasks there’s no obvious right answer it’s just does it do something that follows what us

humans do and you know that question of of uh what’s the right answer okay for humans we might

say well up there you know most humans recognize that as a two if instead we had visual systems

like bees or octopuses or something like this we might come to completely different conclusions

once things get sort of blurred out um we might the question of what we consider to be two like

might be quite different it’s a very human answer that that uh to say that that that still looks

like a two for example depends on our visual system it’s not something where there’s sort of

a mathematically precise definition of that has to be a two okay so question is how do these models

how do these models which we’re using for things like image recognition how do they actually work

the the most popular by far and most successful at present time uh approach to doing this is to

use neural nets and so okay what what what is a neural net it’s kind of an idealization of what

we think is going on in the brain what’s going on in the brain well we all have about 100 billion

neurons in our brains which are nerve cells that have the feature that when they get excited they

produce electrical signals maybe a thousand times a second um they and each nerve cell has

has it’s it’s taking that electrical signal and it’s it has sort of wire-like projections from the

from the nerve cell that are connecting to maybe a thousand maybe ten thousand other nerve cells

and so what happens in a sort of rough approximation is that you’ll have electrical

activity in one nerve cell and that will kind of get communicate itself to other nerve cells

and there’s this whole network of nerves that is has this elaborate pattern of electrical

electrical activity so um and roughly the way it seems to work is that the extent to which

one nerve cell will affect others is determined by uh sort of the the weights associated with

these different connections and so one connection might have a very strong positive effect on another

nerve cell if the first nerve cell is fired then it’s like it makes it very likely the next nerve

cell will fire or that connection might be an inhibitory connection where the if one nerve

cell fires it makes it very unlikely for the next nerve cell to fire there’s some whole combination

of these weights associated with these different connections between nerve cells

so you know what actually happens when we’re trying to recognize a two in an image for example

well the you know the the light the photons from from the from the image fall on the cells at the

back of our eye at our retina these are photoreceptor cells they convert that light

into electrical signals the electrical signals um end up going through nerves that get to the

visual cortex to the back of our head um and uh there’s an array of of uh of nerves that correspond

to all the different essentially pixel positions in the image and then what’s happening is that

within our brains there’s this sequence of connections there’s sort of layers of neurons

that process the electrical signals that are coming in and eventually we get to the point

where we kind of form a thought that that image that we’re seeing in front of us is a two and

then we might say it’s a two um but that process of sort of forming the thought that’s what we’re

talking about is kind of this process of recognition i was talking about it in the in the

actual neural nets that we have in brains but what is being done in all of these models including

things like chat gpt is an idealization of that neural net okay so for example in um uh in the

particular neural net we were using for image recognition this is kind of a wolfram language

representation of that neural net um and we we’re going to talk about um not in total detail but

we’re going to talk about all these pieces in here um it’s it’s very kind of engineering slash

biological there’s a lot of different funky little pieces here that go together to actually have the

result of recognizing digits and so on uh this particular neural net was constructed in 1998

and it’s really was done as a piece of engineering so

uh how do we think about the way this neural net works essentially that the sort of the key idea

is the idea of attractors that’s an idea that actually emerged from mathematical physics and

so on um but uh it’s a key idea when we when we’re thinking about neural nets and such like

and so what is that idea the idea is let’s say we’ve got all these different um uh handwritten

digits the ones the twos etc etc etc what we want is if we lay all these digits out in some way

what we want is that if we are sort of near the ones we are kind of attracted to the one spot

if we’re kind of if the thing we have is kind of near the twos we’re attracted to the two spot

it’s kind of the idea of attractors is imagine that you have some i don’t know mountainscape

or something like this and you are your you know you’re a drop of water that falls somewhere on

the mountain you are going to sort of roll down the mountain until you get to this minimum that

uh is for the from your particular part of the mountain but then there’ll be a watershed

and if you’re a raindrop that falls somewhere else you’ll roll down to a different uh different

minimum a different lake and it’s the same kind of thing here when you move far enough away from

the thing that looks like a one you’ll roll down into the into the twos attractor rather than the

ones attractor that’s kind of the idea there now let’s see we can um uh let’s let’s make a kind of

idealized version of this let’s say we’ve got a bunch of points on the plane let’s say those are

the coffee shops and you say i’m always going to go to the closest coffee shop to me well this

so-called voronoi diagram shows you this this sort of the division the watersheds between coffee

shops if you’re on this side of this watershed you’ll go to this coffee shop if you’re on that

side you’ll go to this coffee shop so that that’s kind of a a minimal version of this idea of

attractors all right so let’s talk about neural nets and their relationship to attractors so let’s

take an even simpler version let’s just take these three attractors there’s the zero attractor the

plus one attractor the minus one attractor we’re still going to say if we are if we fall in this

region we’ll have these have coordinates x and y coordinates so if we’re in this region here

we’re going to eventually we’re going to want to go to say the result is zero we’re in the zero

we’re in the basin of the zero attractor and we want to produce a zero okay so that we can we can

kind of say we can say as a function of the position x and y that we start from what output

do we want to get well in this on this side we want to get a one this one we want to get what

is that a minus one there we want to get a zero this is the thing that we’re trying to

uh we’re we’re we’re trying to we’re trying to set up is something where we’ll have this

this kind of behavior okay well now let’s let’s pull in a neural net so this is a typical tiny

neural net each of these dots represents a an artificial neuron each of these lines represents

a connection between neurons and the kind of the the the blue to redness represents the weight

associated with that connection with blue being the most negative red being the most positive here

and this is showing different this is showing a neural net with particular choices for these

weights by which one neuron affects others okay so how do we use this neural net well we feed in

inputs at the top we say those top two neurons got values 0.5 and minus 0.8 for example interpreting

that in terms of the thing we’re trying to work with that’s saying we’re at position x equals 0.5

y equals minus 0.8 in that diagram that we had drawn so now this neural net is basically just

computing a certain function of these values x and y and at each step what it’s doing is it’s

it’s taking these weights and it’s simply taking so for this neuron here what it’s doing is it’s

saying i want this weight multiplied by this value here uh this weight multiplied by this value here

and then what it says is i’m going to add those two numbers up the numbers based on uh the the

the weights multiplied by the original number then there’s a thing we add we add a constant offset

uh different offset for for uh for we add this constant offset and then we say we get some number

out and then the kind of the the weird thing one does which is sort of inspired by what seems to

happen biologically is we have some kind of thresholding function we say for example this

is a very common one to use relu um if that total number is less than zero make it be not its actual

value but just zero if it’s greater than zero make it be its actual value and there are a variety of

different uh so-called activation functions activation because they’re they’re what determine

what the activity of the next neuron sort of down the line will be based on the input to that neuron

so here again at every step we’re just collecting the values from the neurons at the previous layer

uh multiplying by weights adding this offset applying that activation function relu to get

this value minus 3.8 in this case and what’s happening here is we start off with these values

0.5 minus 0.8 we go through this whole neural net in this particular case at the end it comes out

with value minus one okay what um uh what does that neural net this neural net here the one

we’ve just been showing what does that do as we change those inputs well we can plot it

that’s what that neural net actually does so as a function so remember what our goal is to uh

every time we have a value in this region we want to give a zero this region we want to give a minus

one and so on this is what that particular neural net succeeds in doing so it didn’t quite make it

to give you know the zero one minus one values but it’s kind of close so this is a neural net that’s

been kind of uh set up to be as close as it can be for one of that size and shape and so on to

giving us the exact function we wanted to compute well how do we think about what this neural net

is doing the neural net is just computing some mathematical function so for the particular

neural net i was showing if the w’s are the weights and the b’s are the offsets and so on the f is the

f is the activation function this is the messy sort of algebraic formula that says what the value of

the output is going to be as a function of x and y the values of the inputs so now the question is

well as we look at simpler neural nets what what kinds of functions can we actually compute

so this is at the sort of minimum level this is a single uh this is a neuron here it’s getting

input from two other neurons what function is it computing well it depends on the weights

these are the functions that get computed for these different choices of weights very simple

functions in all cases just these ramps so now we can ask well okay let’s use a slightly more

sophisticated neural net um here’s here’s still a very small neural net this is the best it can do

in reproducing the function we want to get slightly bigger neural net does slightly better

an even bigger neural net up it pretty much nailed it didn’t quite nail it right at the boundary it’s

a bit confused instead of going straight from red to blue it’s got this area where it’s giving yellow

and so on um but in a first approximation this little neural net was a pretty good representation

of the mathematical function that we wanted to compute and this is the same story as as what

we’re doing um in uh um in that um uh recognition of digits where again we’ve got a neural net

it happens to have i don’t know what it was i think it’s about um uh 40 000 um parameters in

this particular case that uh um that that specify kind of um that are doing the same kind of thing

of working out the function that goes from the array of pixels at the beginning to values zero

through nine and so on um well again we can we can ask the question uh you know is it getting the

right answer well again it’s it’s a hard question that’s really a human level question to to because

the question of whether it put a one in the wrong place so to speak it’s a question of how we would

define that well we can do similar kinds of things let’s say we have other kinds of images we might

try and make a neural net that distinguishes cats from dogs and here we’re showing sort of how it

distinguishes those things and mostly the cats are over in this corner the dogs are over in this

corner um but you know the question of what should it really ultimately do uh you know what should it

do if we put a dog in a cat suit should it say that’s a cat or should it say it’s a dog um it’s

going to say some definite thing the question is does it sort of agree with what we humans would

would assess it to to be well you know one question you might ask is well what’s this neural net doing

inside when it works out its sort of catness or its dogness and let’s say we start with um let’s

do this and we can actually do this um let’s say we start with an image um well maybe you know let’s

say we start with an image of a cat here now we can um uh we can say what’s going on inside the

neural net when it decides that this is actually an image of a cat um well what we can do normally

when we are looking at the insides of a neural net it’s really hard to tell what’s happening

in the case where the neural net corresponds to an image we can at least at least neural nets

tend to be set up so that they sort of preserve the the pixel structure of the image so for example

here we can go this is just going what is this going this is going um uh 10 layers down no this

is no this is actually sorry this is actually going just one layer down in the neural net

and what happens in this particular neural net is it takes that image of a cat

and it breaks it up into a lot of different kind of variants of that image now at this level we

can kind of say well it’s doing things that we can sort of recognize it’s kind of looking at

um cat outlines without the background it’s trying to pull the cat out of the background

it’s doing things that we can sort of imagine uh you know describing in words what what’s going on

and in fact many of the things that it’s doing are things that we know from studying neurophysiology

of brains are what the first levels of visual processing and brains actually do by the time

we’re sort of deeper in the neural net um it’s much harder to tell what’s going on let’s say we go

uh 10 10 layers down in the neural net um then uh uh we’ve got again sort of this is in the mind of

the neural net this is what it’s thinking about to try and decide is it a cat or a dog things have

become much more abstract um much harder to to explicitly recognize but that’s kind of um uh what

uh what uh sort of a representation for us of what’s happening in the kind of mind of the

neural net and you know if we say well what’s a theory for how cat recognition works um it’s uh

um it’s not it’s not clear we can have a theory in the sense of sort of a narrative description

a simple way of describing how does the thing tell that it’s a cat you know we we can’t um uh

and if you even ask a sort of human how do you tell we say well it’s got these pointy ears it’s

got this and that thing um it’s hard probably for a human to describe how they do that recognition

and when we look inside the neural net it’s we we don’t get to sort of uh uh have a there’s

no guarantee that there’s a sort of simple narrative for what it’s doing and typically

there isn’t okay so we’ve talked about how neural nets can successfully go from a cat image

to saying that’s a cat versus that’s a dog how do you set the neural net up to do that

so the way we normally write programs is we say well i’m thinking about how should this program

work um what should it do should it first take uh the image of the cat figure out does it have

you know what are the shape of its ears does it have whiskers all these kinds of things that’s

sort of the the typical engineering way to make a program um that’s what people did back

15 years ago 20 years ago in trying to make uh sort of recognize images of things that was the

typical kind of approach was to try and recognize sort of human explainable features of images and

so on um to as a way to kind of recognize things the big idea of machine learning is you don’t have

to do that instead what you can do is just give a bunch of examples where you say this is a cat

this is a dog and have it be the case that you have a system which can learn from those examples

and we just have to give it enough examples and then when you show it a new cat image that’s

never seen before it’ll correctly say that’s a cat versus that’s a dog so let’s let’s talk about

how that how that’s actually done um and uh what we’re interested in is can we take one of those

neural nets i showed that the neural nets where they have all these weights and as you change the

weights you change the function the neural net is computing let’s say you have a neural net and you

want to make it compute a particular function so let’s say let’s take a very simple case let’s say

we have a neural net we just want it to compute as a function of x we want it to compute this

particular function here okay so let’s pick a neural net there’s a there’s a neural net without

weights let’s fill in random weights in that neural net for every random collection of weights in the

neural net the neural net will compute something it won’t be the function we want but it’ll always

compute something it’ll always be the case that when you feed in some value up here you’ll get

out some value down here and these are plots of the function that you get by doing that okay the

the big idea is that if you do it the right way and you can give enough examples of um uh

um of um uh of what function you are trying to learn um you will be able to progressively tweak

the weights in this neural net so that eventually you’ll get a neural net that correctly computes

this function so again what we’re doing here is this is we’re just describing if this is x

x this is let’s say you know g of x down here this is the value x up here and this is a g of x for

some function g and that function g that we want is this kind of uh square wave type thing here

now in this particular case this neural net with these weights is not computing the function we

wanted it’s computing this function here but as we progressively train this neural net we tweak

the weights until eventually we get a neural net that actually computes the function we want

this particular case it took 10 million examples to get to the point where we have the neural net

that we want okay so the um how does this actually work how is this actually done how does one uh as

i said at the beginning we just had we started off with neural nets where we had random weights

with random weights this function x to g of x with that particular choice of weights is this thing

here which isn’t even close to what we wanted so even if we have when we have examples of functions

examples of results we how do we go from those to train the neural net essentially what we’re doing

is we we run we say we’ve got this neural net uh we say let’s pick a value of x 0.2 for example

let’s run it through the neural net let’s see what value we get okay we get this value here

oh we say that value is not correct based on what we were trying to based on the training data that

we have based on this function that we’re trying to we’re trying to train the neural nets to

generate that training it isn’t the correct result it should have been let’s say a minus one

and it was in fact a 0.7 or something okay so then the idea is that knowing that we got it wrong

we we can measure how much we got it wrong and we can do that for many different samples we can take

let’s say a thousand examples of this mapping from value x to function g of x that the neural

net computes and we can say of those thousand examples um how far off were they and we can

compute what’s often called the loss which is take all those values of what what we should have got

versus what we actually got and for example take the sum of the squares of the differences between

those values um and that gives us a sense of if if all the values were right on that would be zero

but in fact it’s not zero because we didn’t actually get the right answer with our neural

net and so then what we’re trying to do is to progressively reduce that loss we’re trying to

progressively tweak the neural net so that we reduce that loss so for example this is what it

would typically look like you typically have this is the loss as a function of the number of examples

you’ve shown and what you see is that as you show more and more examples the loss progressively

decreases reflecting the fact that the the function that’s being computed by the neural

net is getting closer to the function we actually wanted and eventually the loss is really quite

small here and then the function is really computed by the neural net is really close

to the one we wanted that’s kind of the idea of training a neural net we’re trying to tweak the

weights to reduce the loss to to get to where we want okay so let’s say we’ve got a neural net

particular form of weights we compute the loss the loss is really bad it’s we’re pretty far away

how do we arrange to incrementally get closer to the right answer well we have to tweak the

weights but what direction do we tweak the weights in okay so this is a tricky thing that got figured

out well in the 1980s for neural nets how to do this in a reasonably it was known how to do this

in simple cases i should say that the the idea of neural nets originated in 1943 Warren McCulloch

and Walter Pitts were the two guys who wrote the sort of original paper that described these

idealized neural nets and what’s inside chat gpt is basically a big version of what was described

in 1943 and there was sort of a long history of people doing things with just one layer of neural

nets and that didn’t work very well and then early 1980s it started to be some knowledge of how to

deal with with more layers of neural nets and then when gpus started to exist and computers got faster

sort of big breakthrough around 2012 where it became possible to deal with sort of training

and using sort of deep neural nets by the way i for people who are interested i did a discussion

with a friend of mine named Terry Sinofsky who’s been involved with neural nets for about 45 years

now and has been quite instrumental in many of the many of the developments that have happened

i did a discussion with him that was live streamed a few days ago which you can you can find on the

web and so on if you’re interested in that that history but back to back to sort of how these

things work what we want to do is we found the loss is bad let’s reduce the loss how do we reduce

the loss we need to tweak the weights what direction do we tweak the weights in in order to

reduce the loss well this turns out to be a big application of calculus because basically what’s

happening is our neural net corresponds to a function it has it’s a function of the weights

it’s a function of once we when we compute the loss we are basically working out the value of

this neural net function for lots of values of x and y and so on and that object that thing we’re

computing is a big complicated we can think of it as an algebraic formula that we can think of as

being a function of all those weights so how do we make the thing better how do we reduce the overall

value how do we tweak the weights to reduce this this overall loss quantity well we can kind of

use calculus we can kind of say we can think of this as sort of a surface as a function of all

of these weights and we can say we want to minimize this function as a function of the weights so for

example we might have a in a very simplified case this is not good in a very simplified case we

might have a some as a function of just two weights so for example in those neural nets I

was just showing they had I don’t know 15 weights or something in the real example of an image

recognition network it might be 40,000 weights in chat gpt it’s 175 billion weights but here we’re

just looking at two weights and we’re asking if this was the loss as a function of the value of

those weights how would we find the minimum how would we find the how would we find the best

values of those weights see oh there we go so this is a typical procedure to use so-called gradient

descent basically what you do is you say I’m at this position on this lost surface lost surface

where the the coordinates of the surface are weights what I want to do is I want to get to a

lower point on this lost surface and I want to do that by changing the weights always following this

gradient vector kind of down the hill the steepest descent down the hill and that’s something that

you just have to use calculus and you just work out derivatives at this point as a function of

these weights and the direction where you are finding the the maximum of these derivatives

you’re going down the hill as much as you can okay so that’s that’s kind of how you try to

minimize the loss is by tweaking the weights so that you follow this gradient descent thing

to to get to the minimum now there’s a there’s a bit of a bug with this because the surface

that corresponds to all the weights it might have as this picture shows it might have more than one

minimum and actually these minima might not be all at the same height so for example if you’re

on a mountainscape there might be a mountain lake there might be a very high altitude mountain lake

and all of the water that’s kind of following steepest descent down to get to the minimum

only manages to get to that high altitude mountain lake even though there’s a low altitude mountain

lake that will be a much lower value of the loss so to speak that isn’t reached by this gradient

descent method it’s never you you get stuck in a local minimum you never reach the more global

minimum and that’s kind of what what potentially happens in neural nets is you can be okay i’m

going to reduce the loss i’m going to tweak the weights but whoops i can’t really get very far i

can’t reduce the loss enough to be able to successfully reproduce my function with my

neural net or whatever i can’t tweak the weights enough because i got stuck in a local minimum

i don’t know how to get out of that local minimum so this was a the sort of big breakthrough and

surprise of 2012 in in the development of neural nets was the following discovery you might have

thought that you’d have the best chance of getting a neural net to work well when it was a simple

neural net you kind of get your arms around it and figure out all these weights and do all these

calculations and so on but actually it turns out things get easier when the neural net and the

problem it’s trying to solve gets more complicated and roughly the intuition seems to be this although

one didn’t expect this nobody i think expected this i i certainly didn’t didn’t expect this

that um it’s sort of obvious after the fact okay the issue is you are you going to get stuck

as you try and follow this gradient descent well if you’re in a kind of low dimensional space

it’s quite easy to get stuck you just get into the one of these mountain lakes you can’t go any

further but in a high dimensional space there are many different directions you could go and the

chances are any local minimum you get to you’ll be able to escape from that local minimum because

there’ll always be some dimension some direction you can go that allows you to escape and that’s

what seems to be happening it’s not totally obvious it would work that way but that’s what

seems to be happening um in in these neural nets is there’s always sort of a when you have a

complicated enough neural net there’s always a way to escape there’s always a way to reduce the

the loss and so on okay so so that’s kind of the um uh this idea of you tweak the weights

to reduce the loss that’s what’s going on in all neural nets and you can um uh uh there are

different schemes for you know what how you do the gradient descent and how big the steps are and

there are all kinds of different things there are different ways you can calculate the loss

when we’re doing it for language where we’re calculating probabilities of words based on

probabilities of sequences of words based on the model versus based on what we actually see in the

data as opposed to just distances between numbers and so on but it’s the same basic idea okay so

when that happens um let’s see uh we can potentially get um uh every time we run one of

these neural nets we do all this tweaking of weights and so on we get something where yes

we got a neural net that reproduces the thing we want okay so there these are results from

four different neural nets that all successfully pretty much reproduce this function now you might

ask well what happens if i go um uh let’s see what happens if i um yeah what happens if i go

outside the range where i had explicitly trained the neural net i’m telling it i told that my

function x goes to g of x for this range here the one in white but then i say well i’ve got the

neural net now let me try running it for a value of x that i never trained it for what’s it going

to give well that will depend on which particular set of choices about which minimum which weight

tweaking etc etc it went to and so when the neural net tries to figure out things that it wasn’t

explicitly trained on it’s going to give completely different answers depending on the details of how

the neural net happened to get trained that’s it’s kind of like it knows the things which it’s

already seen examples of it’s kind of it’s it’s going to be constrained to basically reproduce

those examples when you’re dealing with things that are kind of out of the box it it might think

differently out of the box so to speak depending on the details of that neural net all right so

let’s see this whole question about training neural nets is um uh it’s a it’s a giant

modern art so to speak of how to train a neural net and the um over the last particularly the

last decade there’s been sort of increasingly elaborate sort of knowledge of that art of

training neural nets and there’s been a certain amount of lore about how these neural nets should

get trained that’s that’s developed so how does that what’s what’s sort of in that law well kind

of the the first question is um uh you know what kind of architecture of neural net how should you

how many neurons how many neurons at each layer how should they be connected together

what should you use um and uh there have been a number of kind of observations and sort of the

art of neural nets that have emerged so what was believed at the beginning was uh every different

task you want a neural net to do you would need a different architecture to do it you would somehow

optimize the architecture for each task it’s turned out that that hasn’t that isn’t the case

it’s much more that you that there are generic neural net architectures that seem to go across

a lot of different tasks and you might say isn’t that just like what happens with computers and

universal computers you need only uh you can run different software on the same computer same

hardware different software that was the kind of idea from the 1930s that launched the whole

computer revolution the whole notion of software and so on is this a repetition of that i don’t

actually think so i think this is actually something slightly different i think that the

reason that the neural nets that the sort of a small number of architectures cover a lot of the

tasks neural nets can do is because those tasks that neural nets can do are tasks that we humans

are also pretty good at doing and these neural nets are kind of reproducing something about the

way we humans do tasks and so while the tasks you’re asking the neural net to do are tasks that

are sort of human-like any human-like neural net is going to be able to do those tasks now there

are other tasks that are different kinds of computations that neural nets and humans are

pretty bad at doing and those will be sort of out of this zone of it doesn’t really matter what

architecture you have well uh okay so there are all kinds of other things that um um that people

sort of wondered about like they said well let’s make instead of making these very simple neurons

that were just like the ones from 1943 let’s make more complicated assemblies of things and

and let’s put more detail into the internal operations of the neural net turns out most of

that stuff doesn’t seem to matter and i think that’s unsurprising from a lot of science that

i’ve done not specifically related to neural nets i think that that um that’s a um um that that’s

something um um that isn’t too surprising now when it comes to neural nets and sort of how they’re

architected um there are a few features that um uh it is useful to to sort of capture a few features

this is not the right thing that’s the right thing um the uh uh there are a few features

of um the data that you’re looking at with the neural net that it is useful to that it seems

useful to capture in the actual architecture of the neural net it’s probably not in the end

ultimately completely necessary it’s probably the case that you could use a much more generic neural

net and with enough training enough enough kind of uh sort of tweaking from the actual data you’d

be able to learn all these things but for example if you’ve got a neural net that’s dealing with

images it is useful to initially arrange the neurons in an array that’s like the pixels and

so this is sort of representation for the particular network called linette that we were showing uh for

image for um uh digit recognition this sort of representation of there’s a first layer of of

neurons here then it sort of thickens up into multiple multiple different copies of the image

which we actually saw um we’re looking at those pictures and then it keeps going and eventually

it rearranges what one thing about neural nets to understand is that neural nets take everything

they’re dealing with and grinds it up into numbers computers take everything they’re dealing with

and eventually grinds it up grind it up into zeros and ones into bits neural nets right now

are grinding things up into uh into arbitrary numbers you know 3.72 they’re they’re real

numbers not not necessarily just zeros and ones it’s not clear how important that is it is necessary

when you’re going to incrementally improve weights and kind of use calculus like things to do that

it’s necessary to have these continuous numbers to be able to do that but in any case whether

you’re showing the neural net a picture a piece of text whatever in the end it’s got to be

represented in terms of numbers and that’s um uh that’s sort of a but but but how those numbers

are arranged like for example here there’s an array of numbers which are sort of arranged in

the in the pixel positions and so on the whole array is reconstituted and rearranged and flattened

and so on and in the end you’re going to get probabilities for each of the uh each of the

10 digits that will be just a sequence of of numbers here sort of a rearranged collection

of numbers okay so let’s see right picture there we go okay so this is um so we’re talking about

sort of um uh how complicated a neural net do you need to achieve it to perform a particular task

sometimes pretty hard to estimate that because you don’t really know how hard the task is

let’s say you want a neural net that plays a game well you can compute the complete game tree for

the game all the possible sequences of games that could occur might be some absolutely huge game

tree but if you want to get human level play for that game you don’t need to reproduce that whole

game tree if you were going to do very systematic computer computation and just play the game by

looking at all the possibilities you’d need that whole game tree but or you need to be able to go

through that whole game tree but in the case of if you’re trying to achieve sort of human-like

performance the humans might have found some heuristic that dramatically simplifies it

and you might need just some much simpler uh much simpler neural net so so this is an example of

well if the neural net is way too simple then it it doesn’t have the ability to reproduce

in this case the function we wanted but you’ll see that as the neural nets get a bit more

complicated we eventually get to the point where we can indeed reproduce the function we wanted

all right well okay so and you can ask you know are there theorems about what

what functions you can reproduce with what what neural nets basically as soon as you have any

neurons in the middle you can at least in principle reproduce any function but you might need an

extremely large number of neurons to do that um and uh it’s also the case that that neural net

might not be trainable it might not be the case that you can find some for example gradient that

always makes the loss go down and so on just by tweaking weights it might be that that you you

couldn’t incrementally get to that result well okay so uh whoops let’s say you’ve got um a uh

you’ve decided on your architecture of your neural net and now you want to train your neural net

okay so the next big thing is you have to have the data to train your neural net from

and there are two basic categories of training that one does for neural nets supervised learning

and unsupervised learning so in supervised learning you give the neural net a bunch of

examples of what you want it to learn so you might say um here are uh 10 000 pictures of cats

10 000 pictures of dogs the pictures of cats are all tagged as being this is a picture of a cat

dogs or there’s a picture of a dog and you’re feeding the neural net these uh these things that

are um kind of explicit things that you want it to learn now that that’s what one has to do for

many forms of of uh machine learning um it can be non-trivial to get the data often there are

sources of data that where you’re sort of piggybacking on something else like you might

get images from the web and they might have alt tags that were text describing the image and that’s

how you might be able to associate the you know the description of the image the fact this is a cat

to the actual image or you might have you know if you’re doing um uh audio kinds of things you might

have something where you um uh you might say let’s get a bunch of videos which have closed captions

and that will give us the the uh sort of supervised information on here’s the audio here’s the text

that corresponds with that audio that’s what we have to learn so that’s um that’s sort of one

style of of uh teaching neural nets is supervised learning where you’ve got data which explicitly

is examples of here’s the input you’re supposed to that you’re going to get here’s the output

you’re supposed to give and that’s great when you can get it um sometimes it’s very very difficult

to get the um the necessary data to be able to train the the machine learning system and when

people say oh can you use machine learning for this task well if there’s no training data the

answer is probably going to be no um unless that task is something that you can either get a sort

of proxy for that task from somewhere else or you can or you just have to blindly hope that

something that um uh sort of was transferred from some other domain might might work just as

when you’re doing mathematical models you might sort of say well linear models or something

worked in these places maybe we can blindly hope they’ll work here doesn’t doesn’t tend

to work that well okay the other the other form of um uh no i should explain another thing about

about neural nets it’s kind of important which is that there’s something been very critical over

the last decade or so the notion of transfer learning so that once you’ve kind of learned

a certain amount with a neural net being able to transfer the learning that’s happened in one

neural net to a new neural net to give it a kind of head start is very important now that that

transfer might be for the first neural net learnt the most important features to pick out an image

let’s feed the second neural net those most important features and let it go on from there

or it might be something uh where you’re using one neural net uh to provide training data for

another neural net so you’re making them compete against each other a variety of other things like

that that those are actually those have different different names the transfer learning thing is

mostly the first thing i was talking about okay so there are issues about how do you get enough

training data how many times do you show the same example to a neural net you know it’s probably a

little bit like humans for us when we memorize things it’s often useful to go back and just

rethink about that exact same example that you were trying to memorize before again so it is

with neural nets and the uh there’s also questions like well you know you’ve got the image of a cat

that looks like this maybe you can get what is the equivalent of another image of a cat just by

doing some simple image processing on the first cat and it turns out that that seems to work that

notion of data augmentation seems to work surprisingly well even fairly simple transformations

are almost as good as new in terms of providing more data well uh okay the um um sort of a the

other big um form of of um of learning that uh learning methodology that that one tends to use

is unsupervised learning where you don’t have to explicitly give sort of uh thing you got as input

example of output so for example in um in the case of uh just trying to keep track of of um

um yeah the um uh in the case of something like chat gpt there’s a there’s a wonderful trick you

can use let’s say chat gpt’s mission is to continue a piece of text okay how do you train it well

you’ve just got a whole bunch of text and you say okay you know chat gpt network here’s the text up

to this point let’s mask out the text after that point can you predict what’s going to come what

you know can you learn to predict what happens if you take off the mask and that’s the task that it

you don’t have to explicitly give it you know input output you’re you’re you’re you’re implicitly

able to get that just from the original data that you’ve been provided so essentially what’s

happening when you’re training the neural net of chat gpt is you’re saying here’s all this

english text it’s from billions of web pages now look at the text up to this point and say can you

correctly predict what text will come later okay gets it wrong you can say well it’s it’s it’s

giving it getting it it’s getting it wrong so let’s that’s provides uh you know that that means

it’s has a uh there’s some loss associated with that let’s see if we can tweak the weights in

the neural net to get it closer to correctly predicting what’s going to come next so in any

case the the end result of all of this is you um make a neural net i i could show you neural net

training in uh i could show you more from language it’s very easy to train uh neural nets to um oh

let’s see now maybe i shouldn’t do the spell let’s see um let’s just let’s just do one

so here’s here’s a collection of handwritten digits um this is what is this going to be

there’s maybe 50,000 handwritten digits uh oh there we go so this is a supervised training

story where where here all the zeros and they say that that’s a hundred and zero and it says it’s a

zero those are the nines it says it’s a nine okay so let’s take a random sample of um i don’t know

2,000 of those um and now we’re going to use that okay there’s our random sample of 2,000 um

handwritten digit and what it was supposed to be okay so let’s take it let’s get a neural net

let’s say let’s try taking this lunette neural net this is now a um um um

um an un an untrained neural net um and now we can just say if we wanted to we could should be

able to say uh just train that neural net with this data there’s that data there uh go on line 32

um let’s say uh train this and so what’s going to happen is this is showing us the loss and this is

showing us as it’s as it’s being presented with more and more of those examples and it’s being

shown the same example many many times you’ll see the loss is going down and it’s gradually learning

okay now now we have a trained neural net and now we could go back to our original collection um

of uh of digits let’s close that up um let’s go back to our original collection of digits let’s

pick a random digit here let’s see whether from um let’s just pick let’s just pick another random

sample here um let’s pick five examples there from um oh i should have not told it to do that okay

there we go so now we can take this trained neural net here’s our trained neural net

and let’s take the trained neural net and let’s feed it that particular nine there now remember

we only trained it on 2 000 examples so it didn’t have very much training but oops i shouldn’t have

done that i should have just used that okay um okay it successfully told us it was a nine that’s

kind of what it looks like to train this is you know wolfram language version of training a neural

net this was a super simple neural net with only 2 000 examples um but that’s kind of what it looks

like to do that um do that training okay so uh let’s see the um uh the thing with with chat gpt

is that your um well let’s let’s yeah we can we can keep going and talk about the training of of

of of that but let’s um before we get to training of of chat gpt we need to talk about one more

thing which is we need to talk about uh this question of kind of well let’s see do we really

need to talk about this yeah we probably should talk about this the question of how you represent

uh kind of things like words with numbers so let’s say we are going to have um we’re we’ve got

all these words and we could just number every word in english we could say apple is 75 pair is

43 etc etc etc um but there’s more useful ways to label words in english by numbers and the more

useful way is to get collections of numbers that have the property that words with nearby meanings

have nearby collections of numbers so it’s as if we’re we’re placing every word somewhere in some

meaning space and we’re trying to set it up so that words will have a position in meaning space

with the property that if two words are nearby in meaning space they must mean close to the same

close to the same thing so here for example are a collection of words laid out in one of these

meaning spaces um sort of actual meaning spaces like the one used by chat gpt are like uh what

is that one it’s probably 12 000 dimensional maybe um this one here is just two-dimensional

we’re just putting things like dog and cat alligator crocodile and then a bunch of fruits

here and what the main thing to notice about this is that things with similar meanings like alligator

and crocodile wind up nearby in this meaning space and you know peach and apricot wind up nearby in

meaning space so in other words we’re representing these words by collections of numbers in this case

just pairs of numbers coordinates which have the property that those coordinates are some kind of

representation of the meaning of these words so and we can do the same thing when it comes to images

uh for example we could ask whether um when we looked at and that’s exactly what we had when we

were looking at some uh a picture like this we’re sort of laying out different handwritten digits

in some kind of uh meaning of the of the handwritten digit space where in that meaning

space the one the ones that mean one were over here the ones that mean three were over here and

so on so a question is how do you find how do you actually uh generate coordinates that represent

the so-called embeddings of of of things so that when they’re sort of nearby in meaning they will

have nearby coordinates okay so there’s a number of neat tricks that are used to do this so a

typical kind of setup is um imagine we have this is just a representation of the neural net that

we use to recognize digits it has these multiple layers each one there’s just a little wolfman

language representation of that um what actually does this network do well in the end what it’s

doing is it’s taking that collection of pixels at the beginning and in the end what it’s doing

is it’s computing um what are the probabilities for a particular configuration it’s going to

produce a collection of numbers at the end because remember neural nets all they ever deal with are

collections of numbers so what it’s going to do is it’s going to produce a collection of numbers at

the end where uh each position in this collection of numbers there’ll be 10 numbers here each

position is the probability that the thing that the neural net was shown corresponded to a zero

or one or two or three or four so what you see here is the numbers are absurdly small except

in the case of four so we can then deduce from this immediately okay that image was was supposed

to be a four so this is kind of the output of the neural net is this collection of probabilities

where in this particular case it was really certain that the thing is a four so that’s what we deduce

now the the thing we can do is we say well let’s let’s back up one layer in the neural net

before we get to that that um let’s just say before we had there’s a there’s a layer that

kind of tries to tries to make the neural net actually make a decision it’s i think it’s a

soft max layer um that uh is um is at the end that’s trying to sort of force the decision it’s

trying to exponentially pull apart these numbers so that the big number gets bigger and the small

numbers get smaller okay but one layer before those numbers are a bit more sober in size before

it’s been sort of torn apart to make a decision those numbers are much more sober in size and

these numbers at this layer give some pretty decent indication of of the fourness of what

we’re seeing they this has more information about what that thing that was shown actually is and we

can think about these numbers as giving some kind of signature some kind of um some some kind of

trace of what kind of a thing we were seeing this is sort of specifying in some sense features of

what we were seeing that later on we’ll just decide that’s a four but all these other sort of

subsidiary numbers are already useful we go back so you know this is um we can define these feature

vectors that represent this is kind of the a feature vector representing that image there

that’s the feature representing this image here and we see that yeah these the the features for

different fours these vectors will be a little bit different um but they’re dramatically different

between a four and an eight but we can use these these vectors to represent kind of uh the the

important aspects of of this four here for for instance and if we go back a couple more layers in

that neural net it turns out we can get an array of like 500 numbers that are a pretty good

representation a pretty good sort of feature signature of of any of these images and we do

the same thing for pictures of cats and dogs we can get this kind of signature of what what this

sort of feature vector associated with what is important about that image and then we can take

those those feature vectors and we can say let’s let’s um let’s lay things out according to

different values in those feature vectors and then we’ll get this kind of um uh embedding in

in the case of what we can think of as some kind of meaning space in the case of words if we look

at the raw um uh yeah so so how do we do that for words well the idea is uh just like for the

for for getting sort of a a a feature vector associated with like let’s say images we have

a task like we’re trying to recognize digits and then we back up from the from the final answer

we’re training a neural net to do that task but what we end up doing is we back up from that final

we we we nailed the task and we say what was the thing that was just before you you managed to nail

the task that’s our representation of the relevant features of the thing well you can do the same

thing for words so for example if we say the blank cat and we then ask in in our training data what

is that blank likely to be um the you know is it black is it white whatever else um that we could

try and make a network that predicts what is that intermediate word likely to be what are the

probabilities for that intermediate word we can train a network to be good at predicting the

probabilities for blackness versus whiteness versus whatever other tabbyness for cats or

whatever it is um and uh once we’ve got that we can then back up from the final answer and say

let’s look at the innards of the network and let’s see what it had done as it got towards coming up

with that final result that thing we get right before it gets to a little bit before it gets

the final result that will be a good representation of features that were important about those words

and that’s how we can then deduce what we can deduce these sort of feature vectors for words

so um in the case of gpt2 for example um we can uh we can compute those feature vectors

they’re extremely uninformative when we look at them in the full feature vectors uh if we

what is more informative is we sort of project these feature vectors down to a smaller number

of dimensions we’ll discover that the cat one is closer to the dog one probably than it is to the

chair one but that’s that’s kind of so what what what um chat gpt is doing when it deals with words

is it uh it’s it’s always representing them using these feature vectors that um using this kind of

um embedding that turns them into these collections of numbers that have the property that nearby

words are have have similar representations actually i’m i’m i’m getting a little bit ahead

of myself there because because the the the way chat gpt works it uses these kinds of embeddings

but it does so for for whole chunks of text rather than for individual words we’ll get there okay

so i think we’re we’re getting on getting on fairly well here um how about the actuality of

of of chat gpt well it’s big neural net millions of neurons uh 175 billion connections between them

um and uh what is its basic architecture um the um um it’s uh the sort of a big idea that actually

came out of language translation networks where the task was start from english end up with french

or whatever else was this idea of what are called transformers it’s an architecture of neural nets

that were more complicated architectures used before there’s actually a simpler one um and the

notion is as i mentioned when one’s dealing with images it’s convenient to have these neurons kind

of attached to pixels at least to sort of laid out in a kind of which pixel is next to which pixel

kind of way there was a so-called convolutional neural nets or conv nets are the the typical

things that are used there in the case of language what transformers do is they are dealing with the

fact that language is in a sequence and with a conv net for an image one saying there’s this

pixel here what what’s happening in the neighboring nearby pixels in the image in a transformer what

one’s doing is one saying this is here’s a word let’s look at the preceding words let’s look at

the words that came before this word and in particular let’s pay attention differently

to different ones of those words so i mean this gets this gets quite elaborate and engineering

quite quickly um and uh uh you know it’s it’s it’s very typical of a sophisticated engineering system

that there’s lots of detail here and i’m not going to go into much of that detail but but um

um this is a piece of the um uh this is sort of the in a sense the front end of of okay so remember

what is chat gpt ultimately doing it’s a neural net whose goal is to continue a piece of text

so it’s going to it’s going to essentially ingest the piece of text so far reading in each token of

the text the tokens are either words or pieces of words like things like the ing at the end of a

word might be a separate token they’re they’re sort of convenient pieces of words they’re about

50 000 different possible tokens it’s reading through the text the prompt that you wrote the

text that it’s generated so far it’s reading through all of those things it is then going to

to generate uh it’s it’s it’s it’s it’s then going to its goal is to then continue that text

in particular it’s going to tell you every time you run through this whole neural net it’s going

to give you one new token it’s going to tell you what the next token should be or what the

probabilities for different choices of the next token should be so one piece of this is the

embedding uh part where what’s happening is it’s reading a token and it is doing i mean this is

just uh you know it’s it gets into a lot of detail here so for example let’s say that the the sequence

we were reading was hello hello hello hello hello bye bye bye bye bye this is showing the resulting

um this is showing the embeddings that you get okay this this is showing you’re trying to

represent i said before we were talking about embeddings for words now we’re talking about

embeddings for whole chunks of text and we’re asking what is the sequence of numbers that

should represent that collection of that piece of text and the way you set that up i mean again this

is this is getting pretty deep into the entrails of the creature um and uh uh well what what what

what you can think of is there are different components to this embedding vector and

let’s see what am i doing here this this picture is showing across the page it’s showing the

contribution from each word and down the page it’s showing the different uh different pieces

of the feature vector that are being built up and the way it works is to it takes each word and it

has um it then the position of the word is encoded by a um uh you could just encode it by saying the

binary but the the position of the word as a binary digit that says is word number seven it’s

you know zero zero zero one one one or something but that doesn’t work as well as essentially

learning this sort of random looking collection of things which are essentially position tags for

words anyway the end result is you’re going to make this thing that represents the um uh

where you have both where each level is a different sort of feature associated with

each of these words and uh that’s that’s the thing that’s going to be fed into the next level of the

of the neural net okay so the next big piece is so-called the tension block in which i i don’t

know how much this is worth explaining i i talk about this a bit more in the in the thing that

i wrote but essentially what’s happening is the in the end it’s just a great big neural net but

that neural net has doesn’t have every possible connection in it it has connections for example

only connections that look back in the that look to places that were earlier in that text and the

it is in a sense concentrating differently on different parts of that text and you can

you can make a picture here of the amount of attention that it is paying and by attention i

mean it’s literally the number the the the the size of effectively the weights that it’s that it’s

using to uh with which it is waiting different parts of the sequence that came in and the way

it works i think for um for gpt3 what it does is it um uh so first of all it has this embedding

vector which for gpt3 is about is 12 288 um i don’t know why it’s that particular oh i do know

why it’s that number it’s multiples of things um long and uh it’s it’s taking

it’s trying to put together a an embedding vector to represent the text so far in which

it has had contributions from words at different positions and it’s it’s sort of it’s figured out

how much contribution it should get from words at each different position um well okay so it does

that then it feeds the whole thing to a a layer of neural net where sort of it has um uh it’s a

um what is it it’s a a 12 000 by 12 000 array um which specifies what where there are 12 000 by 12

000 weights which specify for each incoming neuron each each neuron that’s incoming has this weight

to this outgoing neuron and the result is you get this whole assembly of weights which looks like

nothing in particular this is but these are weights that have been learnt by by chat gpt

to be useful for its task of continuing text and you know you can play little games you can

you can try and visualize those weights by kind of making moving averages and you can kind of see

that the weights sort of roughly are kind of like randomly chosen but this is kind of showing you a

little bit of the detail inside that randomness and in a sense you can think of this as being

sort of a view into the brain of the of chat gpt of showing you at the level of these individual

weights that are in this neural net um what what its representation of human language is

right down the level you know it’s kind of like you take apart a computer and you look at individual

bits inside the cpu this is kind of the same sort of thing for the representation that chat gpt has

of language and turns out there isn’t just one of these attention layers okay what happens is

the the different elements of the feature vector for the text get there are different blocks

of that feature vector that get separated out and handled differently nobody really knows what the

interpretation of those blocks is it’s just been found to be a good thing to do to not treat the

whole feature vector the same but to break it into blocks and treat blocks of pieces in that feature

vector differently maybe there’s an interpretation of one piece of that feature vector that this is

i don’t know words that are about motion or something it won’t be anything like that it

won’t be anything as human understandable as that it’s kind of like a human genome or something

it’s all all the traits are all mixed up in the specification it’s like what uh it’s it’s not

something where we can easily have a sort of narrative description of what’s going on

but what’s been found is that you break this kind of feature vector of of features of the text up

and you have these separate attention heads that um have this sort of re-weighting process going

on for each one you do that and this is where you know this is like it’s crazy that things like this

work but um you do that let’s see 96 times for for chat gpt you’re making you’re doing the same

process 96 times over and uh this is for gpt2 the simpler version this is kind of a representation

of the of the of the things that come out of these attention layers um attention blocks what the uh

what the sort of weights that were used there were and you know these may look there there

is some regularity i don’t know what it means but if you look at the the size of the weights

they’re not perfectly for some layers they’re gaussian distributed for some layers they’re not

i have no idea what the significance of that is it’s just a feature of what um uh what chat gpt

learnt as it was trying to understand human language from from the web um so

so okay the um uh so again that there’s you know we we’ve talked about kind of what’s the the

in the end that what’s happening is it’s just a great big neural net and it’s being it’s being

trained from it we’re trying to deduce the weights for the neural net by showing it a whole bunch of

text and saying uh what weights do you have to have in the neural net so that the um uh so that

the continuation of the text will have the right probabilities for what word comes next that’s its

goal so how uh and so i’ve sort of described the outline of how that’s done um in the end one has

to feed it the reason it’s sort of even possible to do this is that there’s a lot of training data

to feed it so it’s been fed a significant fraction of what’s on the web there are maybe

i don’t know it depends how you describe this but there are maybe six billion maybe

10 billion uh kind of reasonably human written pages on the web where humans actually type that

stuff it wasn’t mostly machine generated etc etc etc that’s on the publicly visible web

not having programs go in and not not selecting lots of different things and seeing what you get

that’s just kind of raw what’s on the web page maybe there’s 10 maybe 100 times as much as that

if you were able to make selections to drill down to go into internal web pages things like this but

so you’ve got something like um uh some you know some number of billions of human written pages

pages and uh there’s a convenient collection called common crawl that’s got where where one

goes where uh it’s um you know you start from one web page you follow all the links you collect all

those pages you keep going just following links following links until you’ve until you’ve visited

all the connected parts of the web but um the result of this is there’s a trillion words of text

that you can readily get from uh from the web um they’re also they’re probably 100 million books

that have been published maybe 100 i think the best estimate maybe 130 million books that have

been published of which five or ten million exist in digitized form and you can use those as a

training data as well and that’s another 100 billion or so uh words of of of of text so you’ve

got trillion-ish words of text and that’s what um uh and there’s probably much more than that in

if you have the um uh the transcriptions of videos and things like this you know for me personally

i’ve kind of been um uh you know as a kind of a personal estimate of these things i i realized that

the things i’ve written over my lifetime constitute about three million words the um the emails i’ve

sent over the last 30 years are another 15 million words and the total uh number of

words i’ve typed is around 50 million um interestingly in the live streams i’ve done

just in the last couple of years i have spoken another 10 million words so it gives a sense of

what you know human output is what but the main point is there’s a trillion words available on on

uh that you can use to train uh a neural net to be able to do this task of of continuing from from

things um it’s uh let’s see in um right so so the actual process of um uh one thing to understand

about training a neural net there’s sort of a question okay there’s a there’s a question when

we looked at those functions before and we said how many neurons do we have to have to represent

this function well how many training examples do we have to give to get the the neural net trained

to represent that function in those cases we didn’t need very big neural nets we need a lot

of training examples there’s all kinds of effort to understand how many training examples do you

actually need how big a neural net do you actually need to to uh do something like do this text

translation uh uh thing well well it’s not really known but uh you know with 175 billion weights

the sort of the surprise is that chat gpt does pretty well now you can ask the question um what

um what’s the uh uh how much training does it need um and uh how many times does it have to be shown

those trillion words what’s the relationship between the trillion words and the number of

weights in the in the um in the network and it seems to be the case that for text um that sort

of the number of weights in the network is sort of comparable to the number of training examples

you sort of show it the training examples about once if you show it too many times it actually

gets worse in its performance it’s very different from what happens when you’re training for

mathematical functions and things like this um but uh one of the things that’s that’s an issue

is that if you’re every time then i should say every every time i i should explain by the way

that the the the every time the neural net runs what’s happening is you’re giving it in the case

of chat gpt you’re giving it this collection of numbers that represents the text it’s gotten so

far and then that collection numbers is the input to the neural net then you sort of ripple through

the neural net layer after layer after layer it’s got about 400 layers um sort of core layers um

it ripples through all those layers and then at the end you get some array of numbers that array

of numbers actually are probabilities for each of the 50,000 possible words in english um and uh

that uh based on that it then picks the next word but so the main operation of chat gpt is a very

just straight through you know you’ve got this text so far given that percolate through this

network say what the next result should be it’s very it just runs through one time it’s actually

very different from the way computers tend to work for other purposes most non-trivial computations

you’re taking the same piece of of of of sort of computational material the same piece of data

and you compute on it over and over and over again in sort of simple models of computation

like turing machines that’s what’s happening all the time that’s what’s happening that’s what makes

computers able to do the non-trivial things computers do or that is that they are taking

a sort of maybe a small number of pieces of data and they’re just read reprocessing things over and

over again what’s happening in something like chat gpt is you’ve got this big network you just

percolate through it once for every token the only sense in which there’s any feedback is that once

you get an output you add that token to the input you feed it on the next step so it’s kind of an

outer loop where you’re giving feedback by adding tokens to the text then that percolates through

then you get another token that percolates through so it’s a very it’s a very big outer loop it’s

probably the case certainly in computers in in lots of non-trivial computations that we do there

are lots of inside loops that are happening quite possibly in the brain there are inside loops that

are happening as well but the model that we have in chat gpt is this kind of just percolate through

once kind of model with a very complicated network but it’s just percolating through once

so that’s how it works but one of the things that’s tricky is that every time it percolates

through it has to use every single one of those weights so every token that chat gpt is producing

it’s essentially doing 175 billion mathematical operations to see how to use each of those weights

to compute the results most likely that’s not necessary actually but we don’t know how to

how to get do any better than that right now but that’s what it’s doing so every time if it has

it’s percolating through doing that well the when you train chat gpt and you are sort of

you’re working out you know how do you deal with oh making the weights change based on the loss

that’s another you’re kind of every time you you make a training step you’re having to kind of do

a reverse version of that forward so-called inference process it turns out that reverse

process isn’t that much more expensive than the forward process but you have to do it a whole lot

of times in the training so typically if you have a model of size n roughly for text it seems like

you need about n squared amount of computational effort to do the training and n is pretty big for

the case when you’re dealing with sort of language and things of the size of chat gpt and so the

training process that that just a little bit mathematical square is a is a really big deal

and it means that you are you know to spend hundreds of millions of dollars potentially

on doing the training with current gpus and things like this is is what you have to think about doing

based on the current model of how neural nets work now i mean i have to say that that there’s a lot

of aspects of the current model that probably aren’t the final model um and you know we can

plainly see that there are big differences between for example things the brain manages to do for

example one big difference is most of the time when you’re training a neural net most of the

uh the the memory and the the the you’re doing that by having you have a bunch of of of things

in memory and you have some computation that’s going on but the things that are in memory are

mostly idle most of the time and there’s just a little bit of computation that’s going on

in brains every one of our neurons is both a place that stores memory and a place that computes

that’s a different kind of setup and we don’t know how to do neural nets training the various

things that have been looked at from the distant past actually about how to do this even from the

1940s people were starting to think about some distributed ways to to do learning in neural

nets but that’s not something that’s that’s landed yet as a thing we can do okay case of chat gpt

um an important thing was and this was something you know six months ago a year ago there were

kind of early versions of of the gpt family uh text completion systems and so on and they were

kind of the text they produced was only so so um and then something was done by open ai uh with

chat gpt which was that there was an additional step a reinforcement learning training step

that was was done where essentially what was done was humans told chat gpt go and make an essay go

and be be a chat bot you know have a conversation with me and the humans rated what came out and

said but that’s terrible that’s better that’s terrible etc etc etc and the thing that was done

was then that that little bit of poking turns out to have had it seems to have had a very big effect

that little bit of kind of human guidance of yes you got the thing from the statistics of the web

now when you look at what you got this direction you’re going in is a bad direction to go and it’s

going to lead to a really boring essay or whatever else um and so that kind of and by the way there’s

a place where where a lot of kind of complication about well what do the humans really think the

the network should be that the system should be producing if the humans say we really don’t want

you to talk about this we really don’t want you to talk about that that’s the place that gets

injected is in this is in this reinforcement learning step uh at the end um and but what you

do is for example is uh given that sort of the way the humans poked at those essays you can watch

what they did when they poked at those essays and rated what happened and so on and you can try and

machine learn that set of things that the humans did then you can use that to provide much more

training data to then retrain a piece of of this uh do retraining of the network to based on the

sort of the tweaking that the humans did you can do sort of fine tuning of this network based on

the the particular poking that the humans did turned into another network that can then be

used to do the training to produce the examples to do the training of of the of the main network

so that’s a thing that seems to have had a big effect on the actual sort of human perception

of what happens in in um uh in in chat upt and i think um uh the other thing that is a um um

sort of a surprise is that you can give it these long prompts in which you tell it all kinds of

things and it will then sort of make use of that in a rather human kind of way in generating

the text that comes later okay big question is how come this works why is it that a thing with only

you know 100 billion or so weights or something can reproduce this sort of amazing thing that

seems to require all of the sort of depth of human thinking and and brains and things like that human

language how does that manage to work and i think the um uh the key thing to realize is what it’s

really telling us is a science fact it’s telling us there’s more regularity in human language than

their thought than we thought there was it’s telling us that this this thing that’s that is

human language has a lot of structure in it and what it’s done is it’s learned a bunch of that

it’s learned a bunch of that structure and it’s learned structure that we never even really

noticed was there and that’s what’s allowing it to generate these kind of plausible pieces of text

that are you know that are making use of the structure we know so we know certain kinds of

structure that exists in language we know the um so for example um uh here’s an example so

one one piece of structure that we know um share this again um

one piece of structure we know is grammatical syntax um the the uh syntactic grammar we know

that the that sentences aren’t random jumbles of words sentences are made up with nouns in

particular places verbs in particular places and we can represent that by a parse tree

in which we say you know here’s the whole sentence there’s a noun phrase a verb phrase

another noun phrase these are broken down in certain ways this is the parse tree and there

are certain that in order for this to be a grammatically correct sentence this has there

are only certain possible forms of parse tree that correspond to a grammatically correct sentence

so this is a regularity of language that we’ve known for a couple of thousand years it’s only

really been codified uh it was big effort to codify it in 1956 um but it was sort of known

this general idea was was known for a long time um but uh then this um um that that we can kind

of represent the sort of syntactic grammar of language by these kinds of rules that say you

can put nouns only together with verbs in this way and that way and to any set of rules and this has

been a big source of controversy in linguistics to any set of rules you can define there’ll always

be some weird exception where people typically say this rather than that but if you you know it’s at

the much like happens in typical machine learning you know if you’re interested in the 95 percent

result then there are just rigid rules and there are a few exceptions here and there okay so that’s

one form of regularity that we know exists in language is is this um uh syntactic um regularity

now what one thing we can do we can ask for sort of um uh chat gpt has effectively implicitly learned

this um syntactic grammar nobody ever told it verbs and nouns go this way and that way it

implicitly learned it by virtue of seeing a trillion words of text on the web which all

have these properties and when it’s saying well what are the typical words that follow well it’s

going to be words that followed in the in the examples it had and those will follow mostly

correct grammar now we can we can take a simpler version of this we can just understand what’s

going on we can take a very very trivial grammar we can take a grammar that’s just a parentheses

just open and close parentheses and something is grammatically correct if we open parentheses and

they always eventually close and this is a parse tree for a um uh for a parenthesis uh you know

open open open close open close etc etc etc this is the parse tree that sort of shows how you can

it’s a representation of of of uh of the sort of the the um the parsing of this sequence of

open and closed parentheses okay so we might say well can we train a neural net to what would it

take to train a neural net to know even this particular kind of syntactic grammar so we looked

at a simple how big was it it was um pretty small uh okay we made a transformer net with eight heads

and length 128 so um uh but but our thing was a was a lot simpler than than um uh than chat gpt but

you can you can use one of these transformers and if you look at the the um uh the post i made that

there’s the actual transformer is there and you can you can play with it in wolfram language um

but in any case if you if you give that transformer this sequence here you say what comes next it says

it says okay uh well 54 probability that there’s a closed parameter based on oh it’s training data

was a randomly selected collection of correct open close open close parenthesis um uh sequences

it has a little bit of a goof here because it says with 0.0838 probability this is the end of

the sequence which would of course be grammatically incorrect because there’s no close for this there’s

for the for the open parentheses here if um uh if we give something which is correctly uh closing

then it says okay great there’s a 34 probability this is the end of the sequence there were no

further opens here it has a little bit of a goof here because it says 15 probability

there’s a closed parenthesis that should occur here which can’t possibly be right because if we

put a closed parenthesis here doesn’t have a corresponding open parenthesis it’s not grammatically

correct but in any case this gives a sense of what it takes for one of these transformer nets we can

look inside this transformer net we can see sort of what it took to learn this very simple grammar

chat gpt is learning the much more complicated grammar of english it’s actually easier

probably to learn the grammar of english because there’s so many clues in the actual words that

are used to how they’re grammatically put together and there’s so many things that we

humans wouldn’t even notice as wrong in some sense of wrong because they’re they’re kind of just what

we do but in this more austere case of just this sort of mathematically defined parenthesis language

we do notice so if we just give it a bunch of open paren open paren etc and we ask it what’s

the highest probability continuation you’ll see it does pretty good up to this point and then

it starts losing it and it’s kind of a little bit like what would happen with humans you know we can

tell at some point here that by just by eye that these are correctly closed it becomes more difficult

to tell that when we get out here and it becomes more difficult for the for the network to tell

that too and this is a typical feature of these neural nets that with these sort of shallow

questions of oh you just have you know you can just see this block of things you see another

block of things it does fine when it has to go to to much greater depth it’s it doesn’t work so well

for a sort of regular computer that can do loops and things inside it’s very easy to to figure out

what’s happening here because you effectively just count up the number of open parens count down the

number of closed parens and so on by the way if you try this in actual chat gpt it also it will

confidently assert that it’s it’s match parentheses but it will often be wrong for larger parenthesis

sequences it has the exact same problem it’s it’s a it fails at a slightly larger size but it’s

still going to fail and that’s just a feature of of this kind of thing so uh well okay so one type

of regularity in language that chat gpt has learnt is syntactic grammar um another type of regularity

there’s there’s one more that that you can readily identify and that’s logic and what is logic well

originally when logic was in was invented you know by aristotle so far as we know you know

what aristotle did was effectively a bit like a machine learning system he looked at lots of

examples of rhetoric lots of example speeches people gave he said what are some forms of

argument that appear repeatedly if somebody says you know uh something like people might have said

you know all men are mortal socrates is a man therefore socrates is mortal um all all x’s are

all x’s are y um uh z is a is a is an x therefore z is a y um the uh uh that that logic is taking

sort of forms of of of language and saying these are patterns that are repeated possible patterns

in these in these pieces of language that are meaningful sequences and originally in

syllogistic logic which is what aristotle originally invented it really was very language

based and people would memorize you know the middle ages people would memorize these forms

of syllogism the barbarous syllogism the seller and syllogism and so on which were just these

these patterns of of word usage where you could substitute in a different word for socrates

but it was still that same pattern that same structure so that was that was that’s kind of

another form of regularity and when chat gpt is says it’s oh it’s it’s figuring things out

well part of what’s figuring out is it knows syllogistic logic because it’s seen a zillion

examples just like aristotle presumably seen a bunch of examples when he invented logic it’s

seen a bunch of examples of this sentence follows this sentence in this way and so it can it’s going

to do that too when it says what’s the statistical thing that’s going to happen based on based on the

web um and so so that’s um uh so by the way when logic developed by the 1800s when people like

boule were getting into the picture and making formal logic um it was no longer just these

patterns boom it’s a pattern it looks like this it was more this thing you could build up many

many layers of structure and you could build you know very complicated logical expressions where

the whole thing was deeply nested and of course our computers today are based on those deeply nested

logical expressions chat gpt doesn’t stand a chance of of decoding what’s going on with one

of those deeply nested kind of mathematical computational style um boolean expressions

but it does well at this kind of aristotle level kind of um uh you know structure of of

sort of templated structure of logic okay well i wanted to talk just for a little bit and then we

should wrap up here and i can try and answer some questions um the uh about kind of what so what are

the regularities that chat gpt has discovered in this thing that we do which is language and all

the thinking that goes on around around language and i don’t know the answer to this i have some

ideas about what’s going on i’ll just you know give a little bit of a tour we talked about kind

of meaning space the sort of space of of how words arrange in in some how you can arrange words in

some kind of meaning space and we can we can kind of see how words arrange these are different parts

of speech for a given word there may be different places in meaning space where different instances

of that word occur this is the word crane and this is different sentences there are two obvious

meanings of crane you know the bird and the the and the machine and they sort of break up in

meaning space where they are we can look at the sort of structure of meaning space another thing

we can ask is is meaning space like physical space is it the case that there are parallel lines in

meaning space are there things where we can go from place a to place b and we and then in parallel we

transport to new places well so we can ask you know if we have analogies is it the case that

we can go you know from woman to man from queen to king that those are sort of parallel

pods in meaning space the answer is well maybe a bit not very convincingly that’s really the

question in in space in physical space this is the question whether this is like flat space

it’s like if we have things moving in flat space you know um newton’s first law says if the thing

is not acted on by a force it’ll just keep going in a straight line well then we have gravity and

we can represent gravity by talking about the curvature of space here this question is when we

go from uh you know ear to hear eye to see those are sort of uh we’re moving in a certain direction

in meaning space and in a sense the question of whether these things correspond to whether we can

do this kind of parallel transport idea is something like how flat is meaning space how

much effective gravity is there in meaning space or something like that meaning space is probably

not something that’s represented in terms of the kinds of things that physical space is represented

in terms of but that’s a question so now when it comes to the operation of chat qpt we can think

we can think about how is it moving around in meaning space it’s got its prompt you know the

best thing about ai is is it’s is its ability to okay um and uh that’s the prompt moving around

in meaning space effectively and then what chat gpt does is it it continues that by continuing

to move in meaning space and so the question is is there something like a semantic law of motion

an analog of of kind of the laws of motion that we have in physical space

in the meaning space of of concepts words something where we can say okay if it’s gone if it’s moved

around this way it’s like it’s got momentum in this direction in meaning space it’s going to

keep going in that meaning space it’s nothing like that simple but the question is what are

how do we think about how do we represent kind of um the the the sort of the the process of going

through meaning space well we can start looking at that we can say uh for example the different

possible continuations that we get the best thing about ai’s ability to and then what’s the next

word well we can look at this kind of fan of different directions that it could go in meaning

space at that point and we can kind of see there’s some there’s some direction in meaning space it

tends to go in that direction it’s not going all the way over here at least not with high probability

okay well if we keep going we can kind of just see sort of how that fan develops as we go further

out as we continue that sentence and we can kind of this is kind of like our motion in meaning

space kind of question and you know i don’t know what this exactly means yet but this is kind of

what it looks like what the trajectory in meaning space as chat gpt tries to continue a sentence

looks like the green is that is the actual thing it chose i think this is a zero temperature case

and the the gray things are the other things that were lower probability cases so that’s that’s um

that’s kind of what um uh that’s some a view if we want to look at we don’t want to want to do

natural science on chat gpt and say what did it discover what did it discover about how language

is put together one possibility is that there are these sort of semantic laws of motion

that describe sort of how meaning how you move through the space of meanings as you add words

into a into a piece of text i think a slightly different way to think about this is in terms of

what one could call semantic grammar so syntactic grammar is just about you know nouns verbs things

like that parts of speech things of that kind but we can also ask is there a generalization

of that that is sort of more semantic that doesn’t just look at that has finer gradations and just

saying it’s a noun it’s a verb and says oh well that verb means motion and when we put this noun

together with this noun that’s a thing you can move together with this motion word it does this

we kind of have buckets of meaning that are finer gradations than just parts of speech but not

but not necessarily individual words is there a kind of a semantic grammar that we can identify

that is kind of this construction kit for how we put together not just sentences that are

grammatically correct that are syntactically grammatically correct but sentences which are

somehow semantically correct now that that um i i strongly think this is possible and it’s

sort of what aristotle was going for he even talks about categories of um uh sort of semantic

categories and things like this he talks about a variety of things he does it in a in a way that’s

based on the fact that it was 2 000 years ago and we didn’t know about computers and we didn’t know

about a lot of kinds of formal things that we know about now uh strangely enough the amount of work

that’s been done trying to make kind of a semantic grammar in the last 2 000 years has been rather

small it’s there was a bit of an effort in the 1600s with people like leibniz with his

characteristica universalis and various other people trying to make what they call philosophical

languages uh sort of language word independent ways of describing meaning and then the more

recent efforts but they’ve tended to be fairly specific fairly based on linguistics um and uh

fairly based on the details of structure of human language and so on um and i think this this uh this

idea that you can kind of have a semantic grammar is is a um and that that’s what’s sort of being

discovered is that there are these rules that go beyond that that are just rules for how you put

together a a meaningful sentence now you know you can get a meaningful sentence could be something

like the elephant flew to the moon does that sentence mean something sure it means something

it has a perfectly we can conjure up an image of what that means has it happened in the world no

it hasn’t happened so far as we know um and uh uh so there’s a but you know could it be in a story

could it be in a fictional world absolutely so this thing about this sort of semantic grammar

will allow you allows you to put together things which are somehow which are sort of um uh meaningful

things to describe about the world um the question of whether they are realized in the world or have

been realized in the world is a separate question but in any case the um the thing that um uh that

is to me interesting about this is it’s it’s something i’ve long thought about because

i’ve spent a large part of my life building a computational language uh wolfman language um

system that is an effort to represent the world computationally so to speak to take the things

that we know about about chemicals or lines or or images or whatever else and have a computational

and have a computational representation for all those things and have a computational language

which knows how all those things work it knows how to compute the distance between two cities

it knows all of those kinds of things and in in um uh and so this is i’ve been spending the last

four decades or so trying to find a way to represent things in the world in this computational

fashion so that you can then compute uh uh you can then compute things about those things uh

in an explicit computational way it’s something where uh and we’ve been very successful at being

able to do that in a sense the story of modern science is a story of being able to formalize

lots of kinds of things in the world and we’re kind of leveraging that in our computational language

to be able to formalize things in the world to compute things about how they’ll work

now the um one feature of that computing about how things work is that inevitably some of those

computations are deep computations they’re computations that something like a chat gpt

can’t possibly do and in a sense there’s sort of a a difference between the things that are

the kind of the the um the sort of shallow computations that you can learn from examples

in something like a chat gpt that you can say this piece of language that i saw on the web here

is you know statistically uh i can sort of fit that in in this place just fitting together these

sort of puzzle pieces of language is a very different thing from taking the world and

actually representing it in some truly sort of formal way computationally so that you can compute

things about how the world works it’s kind of like well back before people had kind of thought of

this idea of of formal formalism maybe 400 years ago or more um you know everything that anybody

figured out was just you think about it in terms of language in terms of words in terms of sort of

immediate human thinking um what what then sort of came in with with mathematical science at first

and then computation was this idea of formalizing things and getting these much deeper uh sort of

ways to deduce what happens and and thing i’ve figured out well 30 40 years ago now was

was this phenomenon of computational irreducibility this idea that there really are things in the world

where to compute what’s going to happen you have no choice but to follow all those computational

steps you can’t just jump to the end and say i know what’s going to happen it’s a shallow kind

of thing and so you know when we look at something like chat gpt there are certain kinds of things

it can do by sort of matching together matching these pieces of language there are other kinds

of things it’s not going to be able to do it’s not going to be able to do

not going to be able to do sort of the mathematical computation the the kind of the the thing which

requires an actual computational representation of the world for those things like us humans

it’s kind of a used tools type uh type situation and very conveniently our wolfman alpha system

that uh um used in a bunch of intelligent assistants and so on is uh has this feature

that it’s using our wolfman language computational language underneath but it actually takes natural

language input so it’s actually able to take the natural language that is produced by a chat gpt

for example take that and then turn that into computational language do a computation work out

the result get the right answer feed that back to chat gpt and then it can talk sense so to speak

rather than just following sort of the statistics of words on the web so it’s a way of you know by

by allowing but you can get sort of the best of both worlds by having something where you have

this sort of flow of of language um as well as as as something where you have this sort of depth

of computation by having chat gpt use wolfman alpha as a tool and i wrote a bunch of stuff

about that and all kinds of things are happening with that um but uh the thing that um uh you know

talking about what did chat gpt discover i think the thing it discovered is there is a semantic

grammar to a lot of things there is a way to represent uh using sort of computational primitives

lots of things that we talk about in in text and in our computational language we’ve got

representations of lots of kinds of things whether it’s foods or chemicals or or stars or whatever

else but when it comes to something like i’m going to eat a piece of chocolate we have a great

representation of the piece of chocolate we know all its nutrition properties we know everything

about it um but we don’t have a good representation yet of i’m going to eat the i’m going to eat part

what i think chat gpt has shown us is that it’s very plausible to get sort of this semantic grammar

of how one has these pieces of of representing these sort of lumps of meaning in language and

i think what’s going to happen and i’ve been interested in doing this for a long time i think

this is now finally the impetus to really uh really roll up one’s sleeves and do it um it’s a it’s a

somewhat complicated project for a variety of reasons not least that you have to make these

kind of uh uh well you you have it has to be you have to make sort of this process of designing a

language is something i happen to have been doing for 40 years designing our computational language

this is a language design problem and those are to my mind those are actually the the single most

concentrated intellectually difficult thing that i know is this problem of language design so this

is sort of a generalization of that but i think chat gpt has kind of shown us what you know i

didn’t know how hard it was going to be i’m now convinced it’s it’s doable so to speak so what

what does this um uh you know you might ask the question you know people might have said okay

look you know we we’ve seen neural nets that do speech to text we’ve seen neural nets do

image identification now we’ve seen neural nets that can write essays surely if we have a big

enough neural net it can do everything well not the neural nets of the kind we have so far that

have the training structure that they have so far not on their own they will not be able to do these

irreducible computations now these irreducible computations are not easy for us humans either

you know when it comes to doing piece of math or worse if somebody says here’s a program run this

program in your head good luck you know very few people can do that um it um it’s something where

there is a a difference between what is sort of immediate and easy for us humans and what is sort

of computationally possible now another question is maybe we don’t care about the things that

aren’t easy for humans it’s turned out that we built an awful lot of good technology over the

last few centuries based on what amounts to a much deeper level we haven’t really in our

technology we’re not actually going even that far into irreducible computation but going far enough

that it’s beyond what we humans can readily do or what we can do with kind of the neural nets that

exist today um so i think the uh that that’s the kind of the thing to understand that there’s a

there’s a certain set of things what’s what’s happening in chat gpt is it’s kind of taking the

average of the web plus books and so on and it’s saying you know i’m going to fit things together

based on that and that’s how it’s writing its essays and it’s and when it is deducing things

when it’s doing logic things like that what it’s doing is it’s doing logic like the way aristotle

discovered logic it’s figuring out oh there’s a pattern of words that looks like this and it tends

to follow it like that because that’s what i’ve seen in in a hundred thousand examples on the web

um so that that’s that’s kind of what it’s doing and it it kind of that gives us some sense of what

what it’s going to be able to do and i think the most important thing it’s able to do is it’s a

form of user interface you know we can get i might get something where i know oh what really matters

in three bullet points but if i’m going to communicate that to somebody else they’re

really not going to understand my three bullet points they need wrapping around that they need

something which is a whole essay describing you know that that’s the human interface so to speak

it’s just like you could have you know the raw bits or something and that wouldn’t be useful

to us humans we have to wrap it in a human like in a sort of human compatible way and language

is sort of our richest human compatible medium and what what chat gpt is doing is it’s able to

i think what the way to think about it is it’s providing this interface that is well it is just

it’s generating pieces of language that are consistent but if you feed it specific things

that it will talk about so to speak then it’s kind of wrapping the thing the specifics with

this interface that corresponds to kind of flowing human language all right i went on much longer

than i intended um and uh uh i see there are a bunch of questions here and i’m going to go from

um and to try and address some of these as a question from antipas are constructed languages

like esperanto more amenable to semantic grammar ai approach very good very interesting question

so i think the one that i was experimenting with was the smallest of the constructed languages

a language called toki pona that has only 130 words in it um it is not a a language that allows

one to express you know everything one might want to express but it’s a good kind of uh

small talk type language a small language for doing small talk so to speak but it expresses

a bunch of decent ideas and so i was i was going to look at yes that it’s a good clue

again to semantic grammar that there are these small constructed languages it also helps um

um i think well i i also think the probably the largest the constructed language is ithquil

is another uh interesting uh source it’s a language which has tried to pull in all of

the kind of language structures from all the all known languages in some first approximation um

the uh that’s um um uh yeah that that’s that yeah so i think the answer is that yes i think

they’re a good uh stimulus for um for thinking about semantic grammar in a sense when people

were trying to do this back in the 1600s they’re very confused about many things but you know

one gives them a lot of they’ve gone a long way given that it was the 1600s they were confused

about things like whether the actual letters that were written as you wrote the language mattered

and how that was you know uh more so than the than the structure of things but but uh there was the

beginning of that um uh that kind of idea um okay i’m going to take these from the end but i want

to go back to some of these others um okay tori is asking how come on study what’s the best way

of prompting chat gpt could a semantic law of motion be helpful undoubtedly yes i don’t know

the answer that i think it’s a good question and i don’t really know um the uh um you know i i think

um yeah i don’t know uh albert is asking is the 4000 token limits analogous to working memory

would accessing larger memory be increasing the token limits or increasing such capabilities

reinforcement learning well i think that the the token limits that exist right now uh are you know

if you want to have a coherent essay and you want it to know what it was talking about back in that

early part of the essay you better have enough tokens in that are being fed into the neural net

every time it gets a new token if it just doesn’t know what it was talking about if it forgot what

it was talking about 5 000 tokens ago it may be saying totally silly things now because it didn’t

know what was there before so in some sense it’s like it’s i don’t think it’s i don’t think it’s

like our short working memory but i think um you know it’s kind of like you ramble on i ramble on

a lot you know talking about things and like i might have forgotten half an hour later that i

talked about that already i might be telling the same story again i hope i don’t do that i don’t

think i do that too badly um but but you know that that’s a question of what that that’s the

kind of thing that happens with this token limit um let’s see let me go back to some of the questions

that were asked earlier here um okay erin was asking talking more about the tension between

super intelligence and computational irreducibility how far can llm intelligence go i think i talked a

little bit about that i think this question oh boy this is this is kind of complicated i mean so

this question about okay

the the universe the world is full of computational irreducibility

that’s it’s full of situations where we know the underlying rules but we run them as a computation

and you can’t shortcut the steps what what we’ve discovered from our physics project is it looks

like the very lowest level of space time works just that way in fact just earlier today saw a

lovely um uh work um about uh um doing practical simulation of space times and things using

using those ideas and very much supporting again this it’s really computationally irreducible at

the lowest level just like in a in something like a gas the molecules are bouncing around in this

computationally irreducible way what we humans do is we sample sort of aspects of the universe that

have enough reducibility that we can predict enough that we can kind of go about our lives

like we don’t pay attention to all those individual gas molecules bouncing around

we only pay attention to the aggregate of the pressure of the gas or whatever else we don’t

pay attention to all the atoms of space we only pay attention to the fact that there’s this thing

that we can think of as more or less continuous space so our story has been a story of finding

slices of reducibility slices places where we can predict things about the universe there’s a lot

about the universe we cannot predict we don’t know and if our existence depended on those things if

we had not found kind of these these slices of reducibility uh we wouldn’t we wouldn’t be able

to have a coherent existence of the kind that we do so if you ask sort of where do you go with that

well there are there are an infinite collection there’s an infinite kind of web of pieces of

computational reducibility there’s sort of an infinite set of things to discover about that

we have discovered some of them as we advance in our science and with our technology for for things

we get to explore more of that kind of web of reducibility but that’s that’s really the issue

now that the problem is that the way that we humans kind of react to that is we have ways to

describe what what we can describe we have a we have words that describe things that are common

in our world we have a word for a camera we have a word for a chair those kinds of things we don’t

have words for things which have not yet been common in our world and you know when we look

at the innards of chat gpt it’s got all kinds of stuff going on in it maybe some of those things

happen quite quite often but we don’t have words for those we don’t have a way we haven’t yet found

a way to describe them when we look at the natural world we’ve there are things that we’ve seen

repeatedly in the natural world we have words to describe them we’ve built up this kind of

descriptive layer for for talking about things but one of the things that happens is that if we

kind of jump out to somewhere else in the sort of universe of possible computations

there may be pieces of reducibility there but we don’t have words to describe those things we only

have we know about the things that are near us so to speak and so and gradually as science advances

is we get to expand the domain that we can talk about so to speak or everything advances we get

to have more words we get to be able to talk about more things but in a sense to have something which

operates it’s this gradual process of us sort of societally in a sense learning more concepts

we kind of can exchange concepts we can build on those concepts and so on but if you throw us out

into some other place in what I call the roulade the space of all possible computational processes

if you throw us out into an arbitrary place there we will be completely confused because there will

be things we can tell there are actual computations going on here there are things happening there’s

even pieces of reducibility but we don’t we don’t relate to those things so it’s kind of like

imagine that you were you know you’re here now and you’re you know chronically frozen for 500

years and you wake up again and there’s all these other things in the world and it’s hard to

reorient for all those other things without having seen the intermediate steps and I think that that

when you talk about kind of what where can you go from what we have now how can you sort of add

more you’re basically intelligence is all about these kind of pieces of reducibility these ways

to jump ahead and not just say it’s what we what we think of as sort of human-like intelligence

is about those kinds of things and I think the you know so what’s the vision of what will happen

you know when when the world is full of AIs sort of interesting because actually we’ve seen it before

I mean when the world is full of AIs and they’re doing all these things and there’s all this

computational irreducibility there are all these pockets of reducibility that we don’t have access

to because we haven’t sort of you know incrementally got to that point what what’s going to

be happening is there’s all this stuff happening among the AIs and it’s happening in this layer

that we don’t understand it’s already happening in plenty of places on the web and you know

bidding for ads or showing you content on the web whatever there’s a layer of AI that’s happening

that we don’t understand particularly well we have a very clear model for that which is nature

nature is full of things going on that are often computationally irreducible that we don’t

understand what we’ve been able to do is to carve out an existence so to speak that is coherent for

us even though there’s all this computational irreducibility going on we’ve got these little

niches with respect to nature which which are convenient for us as as humans so to speak and

I think it’s sort of the same thing with the the AI world as it becomes like the natural world

and it becomes sort of not immediately comprehensible to us that’s we are we are kind of

we’re you know our view of it has to be oh that’s just you know the operation of nature that’s just

something I’m not going to understand oh that’s just the operation of the AI is not going to

understand that there’s this piece that we’ve actually managed to humanize that we can understand

so that’s that’s a little bit of the the thought about about how that develops in other words you

know you can say I’m going to throw you out to some random place in the roolyad there’s incredible

computations happening it’s like great that’s nice I’ve spent a bunch of my life studying those kinds

of things but pulling them back reeling them back into something which has sort of direct human

understandability is is a difficult thing uh Aaron is asking more of a business question about

about Google and the transformer architecture and why you know it’s been a very interesting thing

that the sort of neural nets were this small field very fragmented for many many years and

then suddenly things started to work in 2012 and a lot of what worked and what was really worked on

was done in a small number of large tech companies and some not so large tech companies

and uh it’s sort of a different picture of where innovation is happening than has existed in other

fields and it’s it’s kind of interesting it’s kind of potentially a model for what will happen in

other places but but you know it’s always complicated what um what causes one group to do

this another group to do that and there are the entrepreneurial folk who are smaller and more

who are smaller and more agile and and they’re the folks who have more the more resources and

so on it’s always complicated um okay Nicola is asking do you think the pre-training a large

biologically inspired language model might be feasible in the future I don’t know um I think

that the the figuring out how to train something that is you know we don’t know what parts of the

biology are important one of the one incredibly important things we just learned is that probably

there’s not much more to brains that really matters for their information processing

than the the neurons and their connections and so on it could have been the case that every molecule

has you know some quantum process that’s going on and that’s where thinking really happens but

it doesn’t seem to be the case because this this pinnacle of kind of our sort of thinking powers

of being able to write long essays and so on it’s it seems like that can be done with just a bunch

of neurons with weights now which other parts of biology are important like uh uh you know uh

actually Terry Sinofsky just wrote this paper talking about how there are more backwards going

uh uh neural connections in brains than forwards going ones so in that sense it looks like maybe

maybe we missed the point with these feed forward networks that that’s something like chat gpt

basically is and that the feedback is is uh you know is really important but we don’t yet we

haven’t yet really got the right idealized model of that I do think that the uh you know the sort

of the the what’s the next McCulloch-Pitts type thing what’s the next sort of simple meta model

of of this is important I also think that there’s probably a bunch of essential mathematical

structure to learn about general mathematical structure to learn neural nets you know I was

interested in neural nets back around 1980 and I kind of was trying to simplify simplify simplify

models of things and neural nets I went I went past them because they weren’t simple enough for

me they had all these different weights and all these different network architectures and so on

and I ended up studying cellular automata um and and generalizations of that where where you know

you have something where everything is much simpler there are no real numbers there are no

arbitrary connections there are no this that and the other things

but what what matters and what doesn’t um we just don’t know that yet

uh Paul is asking what about a five senses multimodal model to actually ground the system

in the real world with real human-like experience I think that will be important and that will no

doubt happen and you know you’ll be more human-like look this chat gpt is pretty human-like when it

comes to text because by golly it just read a large fraction of the text that we humans at

least publicly wrote um and but it didn’t know it hasn’t had the experience of walking upstairs and

doing you know doing this or that thing and so it’s not going to be very human-like when it

comes to those sorts of things if it has those experiences then then I think we get to um uh

you know then that that will be interesting um okay someone’s commenting on the fact that I

should do the same kind of description for image generation uh generative AI for images um the uh

the thing that I like to think about there is I think that’s that’s one of our first moments

of communication with an alien intelligence in other words we in some sense we’re talking

to the generative AI in English words or whatever and it’s going into its alien mind so to speak

and plucking out the stuff that is these images and so on it’s it’s less so you know with chat

GPT what the output is something that is already intended to be very human it’s human language

with with um uh an image generation system it’s more uh uh it’s really it’s producing something

which has to be somewhat recognizable to us it’s not a random bunch of pixels it’s something that

resonates with things we know but in a sense it can be it can be more completely creative

in what it’s showing us and in a sense as one tries to sort of uh you know navigate around its

space of what it’s going to show us it feels a lot like kind of you’re communicating with an alien

intelligence and it’s kind of uh it’s kind of showing you things about how it thinks about

things by saying oh you said those words I’m going to do this and so on I mean I have to say that

I’m I’m if if we can’t you know the other examples of alien intelligences that we have all around

the planet are lots of lots of critters from the cetaceans on so to speak um that uh and I have to

believe that if we could correlate kind of the experiences of those critters cats dogs you know

cockatoos whatever else um and the vocalizations that they have and so on and we could you know

that that it’s it’s talk to the animals time so to speak I mean I think that’s a that feels like

that that’s you know the kinds of things we’ve learned from chat gpt about the structure of human

language I am quite certain that if there’s any linguistic structure for other through other

animals it’ll be similar because it’s one of the lessons of biology is you know there are fewer

ideas than you think the you know these things that we have have precursors in biology long long

ago we may have made innovations in language it’s kind of the key innovation of our species

but whatever is there had precursors in in other organisms and and that’s what um and and the fact

that we now have this much better way of kind of teasing out a model for for language in humans

means we should be able to do that elsewhere as well uh okay David is saying chat gpt’s developers

seem committed to injecting uh sort of political curtailments in the code um because uh to avoid

it talking about controversial topics how is that done it’s done through this reinforcement learning

stage I think maybe there’s also some actual you know if it’s starting to use these words just

just stop it type things I think maybe that’s being done a little bit more with maybe with

being than it is with with chat gpt at this point um I think that the um uh I have to say the one

thing that I consider a you know so far as I know chat gpt is a g-rated you know thing and that’s

an achievement in its own right that it doesn’t um maybe I shouldn’t say that because probably

maybe they’re a horrible counterexamples to that but I think that was a um you know in terms of

one of the things that happens is well you have a bunch of humans and they are giving it this

training and those humans have opinions and they will have you know there’ll be this kind of

politics or that kind of politics or they’ll believe in this or that or the other and they are

uh you know whether purposefully or not they’re you know they’re going to impose those opinions

because there is no you know the opinion is what you’re doing when you tell chat gpt that essay is

good that essay isn’t good you know at some level that’s an opinion now that opinion may or may not

be colored into something that is uh about you know politics or something like that but it’s it’s

sort of inevitable that you have that I mean I have to say you know something I’ve thought about

a little bit in connection with with general sort of uh ai injection into sort of the things we see

in the world like social media content and so on I tend to think that the right way to solve this

is to say okay let’s have multiple you know chatbots or whatever and they are in effect trained

with different criteria by different groups under different banners so to speak and you know you get

to pick the banner of chatbots that you want to be using and then then you’re happy because you’re

not seeing things that horrify you and and so on and and you can discuss you know whether you want

to pick the chatbot that that accepts the most diverse views or whether you want to you know

that that’s a that’s that sort of throws one back into um into kind of standard issues of political

philosophy and things like this I mean I think the thing to realize is that there is a there’s

sort of an ethics you know one wants to put ethics somehow into what’s going on but when one says

let’s have the ai’s you know do the ethics it’s like that’s hopeless ethics is a there is no sort

of mathematically definable perfect ethics ethics is a the way humans want things to be and then you

have to choose you know well is it the average ethics is it the you know the ethics which makes

only five percent of the people unhappy is it this that and the other these are old questions of

political philosophy that don’t really have so far as we know good answers and but once thrown into

those questions there’s no you know oh we’ll get a machine to do it and it’ll be perfect it won’t

happen because these are questions that that aren’t solvable for a machine because they’re

questions that in a sense come right from us these are I mean the thing to realize about chat gpt in

general chat gpt is a mirror on us it’s taken what we wrote on the web so to speak in a in in aggregate

and it’s reflecting that back to us so insofar as it does goofy things and says goofy things

you know some that’s really on us I mean that’s you know it’s our sort of it’s it’s the average

um kind of uh the the the sort of the average web that we’re seeing here um

tenacious is asking about a particular paper which I sounds interesting but I don’t know about it

uh let’s see up up soon here um

okay tragath is wondering how neural net ai compares to other living multicellular intelligence

uh plant roots

um nerve nets and things like jellyfish and so on biofilms yeah well okay so

one of the big things that’s come out of a bunch of science that I’ve done is this thing

I call the principle of computational equivalence which essentially says that as soon as you have

a system that is not computationally trivial it will ultimately be equivalent in its computational

capabilities and that’s an important thing when you talk about computational irreducibility

because computational irreducibility arises because that you’ve got a system doing its computation

there’s no system you can’t expect there to be all other systems will just be equivalent

in their computational sophistication you can’t expect a super system that’s going to jump ahead

and just say oh you went through all these computational steps but I can jump ahead

and just get to the answer now a question that is a really good question is when we look at okay

one of the things that is characteristic of our consciousness for example relative to all

the computational irreducibility in the universe is the fact that we have coherent consciousness

is a consequence of the fact that we are two things it seems to me we are computationally

bounded we’re not capable of looking at all those molecules bouncing around we only see

various aggregate effects point one and point two that we are we believe that we are persistent in

time we believe we have a persistent thread of of existence through time turns out big fact of our

last few years for me is that the big facts of physics general relativity theory of gravity

quantum mechanics and statistical mechanics the second law of thermodynamics law of entropy

increase all three of those big theories of physics that arose in the 20th century all three

of those can be derived from knowing that we human observers are noticing those laws and we human

observers have those two characteristics i just mentioned i consider this a a very important

beautiful sort of profound result about kind of the fact that we observe the physics we observe

because we are observers of the kind that we are now interesting question i suppose is when we

so we are limited we are computationally limited things and the very fact that we observe physics

the way we observe physics is a consequence of those computational limitations so a question is

how similar are the computational limitations in these other kinds of systems in a sense

the fungus as observer so to speak how similar is that kind of observer to a human observer

and in terms of sort of what computational capabilities it has and so on my guess is it’s

pretty similar and in fact one of my next projects is a thing i’m calling observer theory which is

kind of a general theory of uh of kinds of observers that you can have of things and so

maybe we’ll learn something from that but it’s a it’s a very interesting question uh dugan is

commenting um uh chat gpt can be improved using an automated fact checking system like an adversarial

network for instance um could one basically could one train chat gpt with wolfram alpha

and have it get better the answer is surely up to a point but then it will it will lose it just

like it does with parentheses i mean there’s a certain with a network of that architecture

there’s a certain set of things one can learn but one cannot learn what is computationally

irreducible i mean it’s in other words you can learn the common cases but there’ll always be

surprises there’ll always be unexpected things that you can only get to by just explicitly doing

those computations bob is asking can chat gpt play a text-based adventure game i bet it can

i don’t know i haven’t seen anybody try that but i bet it can um

okay there’s a question here from software uh aside from being trained on a huge corpus what

is it about gpt3 that makes it so good at language i think i tried to talk about that a bit

about the fact that we it’s it’s um uh that there’s you know there’s regularity in language i

think the the particulars of the transformer architecture of this kind of looking back on

looking back on sequences and things that’s been helpful in refining the way that you can train it

and that that seems to be important uh let’s see um

atoria is asking could feature impact scores help us understand gpt better

uh well so what that what that’s about is when you run a neural net you can kind of uh you can say

uh sort of how much what was the how much did some particular feature affect the output that

the neural net gave chapter gpt is just a really pretty complicated thing i mean i started digging

around trying to understand sort of what as a natural scientist you know i’m like i couldn’t do

sort of neuroscience with actual brains because i’m a hundred times thousand times too squeamish

for that but you know i can dig around inside an artificial brain and i started trying to do that

and it’s it’s it’s difficult i mean i i i didn’t look at feature impact scores i i think one could

um the uh okay so um but by the way i mean i’m i’m amused by these questions because

because i i can kind of you know i can still tell you guys are not bots i think and um uh

uh let’s see

um ron is asking about implications like i have to work late tonight what does that mean um yeah

absolutely chat gpt is learning stuff like that because because it’s seen you know a bunch of

texts that says i have to work work late tonight um so i can’t do this it’s seen examples of that

it’s kind of doing the aristotle again it’s just seeing this uh you know these patterns of language

and that’s what it’s learning from so to speak um so yes these things we might say how do we think

about that formally oh it seems kind of complicated to us but that pattern of

language has has occurred before all right last last thing perhaps um

uh okay albert is asking do you think humans learn efficiently because they’re born with

the right networks to learn language more easily or is there some difference i think

it is important the architecture of the brain undoubtedly is important i mean uh you know

my impression is that there are uh you know it’s it’s a matter for the neuroscientists to go and

find out now that we know that certain things can be made to work with artificial neural nets

did the actual brain discover those things too and the answer will be often yes i mean just like

there are things we probably have learned from you know the flight of drones or the flight of planes

that we can go back and say oh did we did biology actually already have that idea um i think that

the um uh there are undoubtedly features of human language which depend on aspects of the brain i

mean like for example one you know talking to terry it’s an oscar you know we’re talking about

the um the loop between the basal ganglia and the cortex and the possibility that you know the outer

loop of chat gpt is a little bit like that loop and it’s kind of like i’m turning things

over in my mind one might say maybe that’s actually a loop of data going around this

literal loop from one part of the brain to another maybe maybe not but sometimes those

those sayings have a habit of of being more true than you think and maybe the reason that when we

think about things we have these certain time frames when you think about things there’s certain

times between when words come out and so on maybe those times are literally associated with the

amount of time it takes for signals to propagate through some number of layers in our in our uh

in in our brains and i think that in that sense if that’s the case there will be features of

language which are yes we’ve got this brain architecture we’re going to have this these

features of language and insofar as language evolves as so far as it’s it’s adaptively

worthwhile to have a different form of language that is optimized by having some different form

of brain structure that’s what will have been driven by by natural selection and so on i mean

i think you know there are aspects of language like we know if you you know we tend to remember

five chunks you know chunks of five so to speak things at a time and we know that if we try and

give a sentence which has more and more and more deeper deeper deeper sub clauses we lose it off

to some point and that’s presumably a hardware limitation of our brains uh okay dave is asking

this is a good last question how difficult will it be for individuals to train something like a

personal chat gpt that learns to behave more and more like a clone of the user i think it i don’t

know um i’m going to try it i have a lot of training data as i mentioned you know 50 million

typed words type yeah typed words for example uh for me um and uh um my guess is i mean i know

somebody tried to train a um an earlier gpt3 on on stuff of mine wasn’t i didn’t think it

was terribly good when i read ones trained for other people i thought they were pretty decent

when i when i looked at one trained for myself because i kind of know myself better than i know

anybody else i think um the uh uh you know it didn’t ring true so to speak um and uh but i i

do think that that will be a uh you know being able to write emails like i write emails it’ll

do a decent job of that i suspect uh you know i would like to believe that uh you know one still

as an as a human one still has an edge because in a sense one knows what the goals are the the

you know know this system its goal is to complete english text and you know the bigger picture of

what’s going on is not going to be part of what it has except insofar as it learns the aggregate

bigger picture from just reading lots of text so uh you know but but i i do think it’ll be an

interesting i i i expect that you know i as a person who gets a lot of email some of which is

fairly easy to answer in principle that you know maybe my bot will be able to answer the easiest

stuff for me all right that’s probably a good place to to wrap this up um thanks for joining me

and uh uh i would like to say that for those interested in more technical details uh some

of the folks in our machine learning group are going to be doing some more detailed technical

webinars about uh about this material and uh really going into how you would um you know how

you build these things from scratch and so on um and uh what some of the more more more detail about

what’s happening uh actually is but i should wrap up here for now and um thanks for joining me

and uh bye for now