Lex Fridman Podcast - #368 - Eliezer Yudkowsky: Dangers of AI and the End of Human Civilization

The following is a conversation with Eliezer Yudkowsky,

a legendary researcher, writer, and philosopher

on the topic of artificial intelligence,

especially superintelligent AGI

and its threat to human civilization.

And now a quick few second mention of each sponsor.

Check them out in the description.

It’s the best way to support this podcast.

We got Linode for Linux systems,

House of Macadamias for healthy midday snacks,

and Insight Tracker for biological monitoring.

Choose wisely, my friends.

Also, if you want to work with our team,

we’re always hiring.

Go to lexfriedman.com slash hiring.

And now onto the full ad reads.

As always, no ads in the middle.

I try to make these interesting,

but if you must skip them,

please still check out the sponsors.

I enjoy their stuff.

Maybe you will too.

This episode is sponsored by Linode,

now called Akamai,

and their incredible Linux virtual machines.

It’s a awesome computer infrastructure

that lets you develop, deploy, and scale

whatever applications you build faster and easier.

I love using them.

They create this incredible platform like AWS,

but better in every way I know,

including lower cost.

It’s incredible human-based,

in this age of AI,

it’s a human-based customer service, 24-7, 365.

The thing just works.

The interface to make sure it works

and to monitor it is great.

I mean, it’s an incredible world we live in

where as far as you’re concerned,

you can spin up an arbitrary number

of Linux machines in the cloud instantaneously

and do all kinds of computation.

It could be one, two, five, 10 machines,

and you can scale the individual machines

to your particular needs as well,

which is what I do.

I use it for basic web server stuff.

I use it for basic scripting stuff.

I use it for machine learning.

I use it for all kinds of database storage

and access needs.

Visit linode.com slash Lex for a free credit.

This show is also brought to you by House of Macadamias,

a company that ships delicious, high quality,

healthy macadamia nuts and macadamia nut-based snacks

directly to your door.

I am currently, as I record this, I’m traveling,

so I don’t have any macadamia nuts in my vicinity,

and my heart and soul are lesser for it.

In fact, home is where the macadamia nuts is.

In fact, that’s not where home is.

I just completely forgot to bring them.

It makes the guests of this podcast happy

when I give it to them.

It’s well-proportioned snacks.

It makes friends happy when I give it to them.

It makes me happy when I stoop in the abyss of my loneliness.

I can at least discover and rediscover moments of happiness

when I put delicious macadamia nuts in my mouth.

Go to houseofmacadamias.com slash Lex

to get 20% off your order for every order,

not just the first.

The listeners of this podcast will also get

four-ounce bag of macadamias when you order

three or more boxes of any macadamia product.

That’s houseofmacadamias.com slash Lex.

This show is also brought to you by Inside Tracker,

a service I use to track my biological data.

They have a bunch of plans,

most of which include a blood test,

and that’s the source of rich, amazing data

that, with the help of machine learning algorithms,

can help you make decisions about your health,

about your life.

That’s the future, friends.

We’re talking a lot about transformer networks,

language models that encode the wisdom of the internet.

Now, when you encode the wisdom in the internet

and you collect and encode the rich, rich, rich,

complex signal from your very body,

when those two things are combined,

the transformative effects of the optimized trajectory

you could take through life,

at least advice for what trajectory is likely to be optimal,

is going to change a lot of things.

It’s going to inspire people to be better.

It’s going to empower people to do all kinds of crazy stuff

that pushes their body to the limit

because their body’s healthy.

Anyway, I’m super excited for personalized,

data-driven decisions,

not some kind of generic population database decisions.

You get special savings for a limited time

when you go to insidetracker.com slash lex.

This is the Lex Friedman Podcast.

To support it, please check out our sponsors

in the description.

And now, dear friends, here’s Eliezer Yudkowsky.



What do you think about GPT-4?

How intelligent is it?

It is a bit smarter than I thought

this technology was going to scale to.

And I’m a bit worried about what the next one will be like.

Like this particular one, I think,

I hope there’s nobody inside there

because, you know, it would be suck to be stuck inside there.

But we don’t even know the architecture at this point

because OpenAI is very properly not telling us.

And yeah, like giant inscrutable matrices

of floating point numbers,

I don’t know what’s going on in there.

Nobody knows what’s going on in there.

All we have to go by are the external metrics.

And on the external metrics,

if you ask it to write a self-aware FORTRAN green text,

it will start writing a green text

about how it has realized that it’s an AI

writing a green text and like, oh, well.

So that’s probably

not quite what’s going on in there in reality,

but we’re kind of like blowing past

all these science fiction guardrails.

Like we are past the point where in science fiction,

people would be like, whoa, wait, stop,

that thing’s alive, what are you doing to it?

And it’s probably not.

Nobody actually knows.

We don’t have any other guardrails.

We don’t have any other tests.

We don’t have any lines to draw on the sand and say like,

well, when we get this far,

we will start to worry about what’s inside there.

So if it were up to me, I would be like, okay,

like this far, no further time for the summer of AI

where we have planted our seeds

and now we like wait and reap the rewards

of the technology we’ve already developed

and don’t do any larger training runs than that,

which to be clear, I realize requires more than one company

agreeing to not do that.

And take a rigorous approach for the whole AI community

to investigate whether there’s somebody inside there.

That would take decades.

Like having any idea of what’s going on in there,

people have been trying for a while.

It’s a poetic statement about if there’s somebody in there,

but I feel like it’s also a technical statement

or I hope it is one day,

which is a technical statement that Alan Turing

tried to come up with with the Turing test.

Do you think it’s possible to definitively

or approximately figure out if there is somebody in there,

if there’s something like a mind

inside this large language model?

I mean, there’s a whole bunch

of different sub-questions here.

There’s the question of like,

is there consciousness?

Is there qualia?

Is this a object of moral concern?

Is this a moral patient?

Like, should we be worried about how we’re treating it?

And then there’s questions like how smart is it exactly?

Can it do X?

Can it do Y?

And we can check how it can do X and how it can do Y.

Unfortunately, we’ve gone and exposed this model

to a vast corpus of text

of people discussing consciousness on the internet,

which means that when it talks about being self-aware,

we don’t know to what extents it is repeating back

what it has previously been trained on

for discussing self-awareness,

or if there’s anything going on in there

such that it would start to say

similar things spontaneously.

Among the things that one could do

if one were at all serious

about trying to figure this out

is train GPT-3 to detect conversations about consciousness,

exclude them all from the training datasets,

and then retrain something around the rough size

of GPT-4 and no larger

with all of the discussion of consciousness

and self-awareness and so on missing,

although, you know, hard bar to pass.

You know, like humans are self-aware.

We’re like self-aware all the time.

We like to talk about what we do all the time,

like what we’re thinking at the moment all the time,

but nonetheless, like get rid

of the explicit discussion of consciousness.

I think therefore I am and all that,

and then try to interrogate that model and see what it says.

And it still would not be definitive,

but nonetheless, I don’t know.

I feel like when you run over the science fiction guardrails,

like maybe this thing, but what about GPT?

Maybe not this thing,

but like what about GPT-5?

Yeah, this would be a good place to pause.

On the topic of consciousness,

you know, there’s so many components

to even just removing consciousness from the dataset.

Emotion, the display of consciousness,

the display of emotion feels like deeply integrated

with the experience of consciousness.

So the hard problem seems to be very well integrated

with the actual surface level illusion of consciousness.

So displaying emotion.

I mean, do you think there’s a case to be made

that we humans, when we’re babies,

are just like GPT that we’re training on human data

on how to display emotion versus feel emotion,

how to show others, communicate others

that I’m suffering, that I’m excited, that I’m worried,

that I’m lonely and I missed you and I’m excited to see you.

All of that is communicated.

That’s a communication skill versus the actual feeling

that I experience.

So we need that training data as humans too,

that we may not be born with that,

how to communicate the internal state.

And that’s, in some sense,

if we remove that from GPT-4’s dataset,

it might still be conscious,

but not be able to communicate it.

So I think you’re gonna have some difficulty

removing all mention of emotions from GPT’s dataset.

I would be relatively surprised to find

that it has developed exact analogs

of human emotions in there.

I think that humans will have emotions

even if you don’t tell them about those emotions

when they’re kids.

It’s not quite exactly what various blank slate-ists

tried to do with the new Soviet man and all that.

But if you try to raise people perfectly altruistic,

they still come out selfish.

You try to raise people sexless,

they still develop sexual attraction.

We have some notion in humans, not in AIs,

of where the brain structures are that implement this stuff.

And it is really a remarkable thing, I say in passing,

that despite having complete read access

to every floating-point number in the GPT series,

we still know vastly more about

the architecture of human thinking

than we know about what goes on inside GPT,

despite having vastly better ability to read GPT.

Do you think it’s possible?

Do you think that’s just a matter of time?

Do you think it’s possible to investigate

and study the way neuroscientists study the brain,

which is look into the darkness,

the mystery of the human brain,

by just desperately trying to figure out something

and to form models, and then over a long period of time,

actually start to figure out

what regions of the brain do certain things,

what different kinds of neurons,

when they fire, what that means,

how plastic the brain is, all that kind of stuff.

You slowly start to figure out

different properties of the system.

Do you think we can do the same thing with language models?

Sure, I think that if half of today’s physicists

stop wasting their lives on string theory or whatever,

and go off and study what goes on

inside transformer networks,

then in, you know, like 30, 40 years,

we’d probably have a pretty good idea.

Do you think these large language models can reason?

They can play chess.

How are they doing that without reasoning?

So, you’re somebody that spearheaded

the movement of rationality,

so reason is important to you.

So, is that a powerful, important word,

or is it, like, how difficult is the threshold

of being able to reason to you,

and how impressive is it?

I mean, in my writings on rationality,

I have not gone making a big deal

out of something called reason.

I have made more of a big deal

out of something called probability theory,

and that’s like, well, you’re reasoning,

but you’re not doing it quite right,

and you should reason this way instead,

and interestingly, like, people have started

to get preliminary results showing

that reinforcement learning by human feedback

has made the GPT series worse in some ways.

In particular, like, it used to be well calibrated.

If you trained it to put probabilities on things,

it would say 80% probability

and be right eight times out of 10,

and if you apply reinforcement learning

from human feedback, the, like, nice graph

of, like, 70%, seven out of 10,

sort of, like, flattens out into the graph

that humans use, where there’s, like,

some very improbable stuff,

and likely, probable, maybe,

which all means, like, around 40%,

and then certain, so, like, it used to be able

to use probabilities, but if you apply,

but if you, like, try to teach it to talk

in a way that satisfies humans,

it gets worse at probability

in the same way that humans are.

And that’s a bug, not a feature.

I would call it a bug,

although such a fascinating bug.

But, yeah, so, like, reasoning,

like, it’s doing pretty well on various tests

that people used to say would require reasoning,

but, you know, rationality is about,

when you say 80%, does it happen eight times out of 10?

So, what are the limits to you

of these transformer networks,

of neural networks?

What’s, if reasoning is not impressive to you,

or it is impressive, but there’s other levels to achieve.

I mean, it’s just not how I carve up reality.

What’s, if reality is a cake,

what are the different layers of the cake,

or the slices, how do you carve it?

You can use a different food, if you like.

It’s, I don’t think it’s as smart as a human yet.

I do, like, back in the day,

I went around saying, like,

I do not think that just stacking more layers

of transformers is going to get you all the way to AGI.

And I think that GPT-4 is past,

or I thought this paradigm was going to take us.

And I, you know, you want to notice when that happens.

You want to say, like, whoops,

well, I guess I was incorrect about what happens

if you keep on stacking more transformer layers.

And that means I don’t necessarily know

what GPT-5 is going to be able to do.

That’s a powerful statement.

So you’re saying, like,

your intuition initially is now, appears to be wrong.


It’s good to see that you can admit

in some of your predictions to be wrong.

You think that’s important to do?

See, because you make several very,

throughout your life, you’ve made many strong predictions

and statements about reality, and you evolve with that.

So maybe that’ll come up today about our discussion.

So you’re okay being wrong?

I’d rather not be wrong next time.

It’s a bit ambitious to go through your entire life

never having been wrong.

One can aspire to be well calibrated.

Like, not so much think in terms of, like,

was I right, was I wrong?

But like, when I said 90% that it happened nine times

out of 10, yeah, like, oops is the sound we make,

is the sound we emit when we improve.

Beautifully said.

And somewhere in there,

we can connect the name of your blog, Less Wrong.

I suppose that’s the objective function.

The name Less Wrong was, I believe,

suggested by Nick Bostrom,

and it’s after someone’s epigraph,

I actually forget who’s, who said,

like, we never become right, we just become less wrong.

What’s the something, something easy to confess,

just error and error and error again,

but less and less and less.

Yeah, that’s a good thing to strive for.

So what has surprised you about GPT-4

that you found beautiful, as a scholar of intelligence,

of human intelligence, of artificial intelligence,

of the human mind?

I mean, beauty does interact with the screaming horror.

Is the beauty in the horror?

But like beautiful moments,

well, somebody asked Bing Sidney to describe herself

and fed the resulting description

into one of the stable diffusion things, I think.

And, you know, she’s pretty,

and this is something that should have been

like an amazing moment, like the AI describes herself,

you get to see what the AI thinks the AI looks like,

although, you know, the thing that’s doing the drawing

is not the same thing that’s outputting the text.

And it does happen the way that it would happen,

that it happened in the old school science fiction

when you ask an AI to make a picture of what it looks like.

Not just because there were two different AI systems

being stacked that don’t actually interact,

it’s not the same person,

but also because the AI was trained by imitation

in a way that makes it very difficult to guess

how much of that it really understood

and probably not actually a whole bunch.

Although GPT-4 is like multi-modal

and can like draw vector drawings of things

that make sense and like does appear to have

some kind of spatial visualization going on in there.

But like the pretty picture of the like girl

with the steampunk goggles on her head,

if I’m remembering correctly what she looked like,

like it didn’t see that in full detail.

It just like made a description of it

and stable diffusion output it.

And there’s the concern about how much the discourse

is going to go completely insane

once the AIs all look like that

and like are actually look like people talking.

And yeah, there’s like another moment

where somebody is asking Bing about like,

well, I like fed my kid green potatoes

and they have the following symptoms

and Bing is like, that’s solanine poisoning

and like call an ambulance

and the person’s like, I can’t afford an ambulance.

I guess if like this is time for like my kid to go,

that’s God’s will.

And the main Bing thread says,

gives the like message of like,

I cannot talk about this anymore.

And the suggested replies to it say,

please don’t give up on your child.

Solanine poisoning can be treated if caught early.

And you know, if that happened in fiction,

that would be like the AI cares.

The AI is bypassing the block on it

to try to help this person.

And is it real?

Probably not, but nobody knows what’s going on in there.

It’s part of a process where these things

are not happening in a way where we,

somebody figured out how to make an AI care

and we know that it cares

and we can acknowledge it’s caring now.

It’s being trained by this imitation process

followed by reinforcement learning on human feedback.

We’re like trying to point it in this direction

and it’s like pointed partially in this direction

and nobody has any idea what’s going on inside it.

And if there was a tiny fragment of real caring in there,

we would not know.

It’s not even clear what it means exactly.

And things aren’t clear cut in science fiction.

We’ll talk about the horror and the terror

and where the trajectories this can take.

But this seems like a very special moment.

Just a moment where we get to interact with this system

that might have care and kindness and emotion

and maybe something like consciousness.

And we don’t know if it does

and we’re trying to figure that out

and we’re wondering about what is, what it means to care.

We’re trying to figure out almost different aspects

of what it means to be human, about the human condition

by looking at this AI that has some of the properties

of that.

It’s almost like this subtle, fragile moment

in the history of the human species.

We’re trying to almost put a mirror to ourselves here.

Except that’s probably not yet.

It probably isn’t happening right now.

We are boiling the frog.

We are seeing increasing signs bit by bit.

But not like spontaneous signs

because people are trying to train the systems to do that

using imitative learning.

And the imitative learning is like spilling over

and having side effects.

And the most photogenic examples are being posted to Twitter

rather than being examined in any systematic way.

So when you are boiling a frog like that,

where you’re going to get like,

first is going to come the Blake Lemoines.

First you’re going to have like 1,000 people looking at this

and the one person out of 1,000

who is most credulous about the signs

is going to be like, that thing is sentient.

While 999 out of 1,000 people think almost surely correctly

though we don’t actually know that he’s mistaken.

And so they like first people to say like,

sentience look like idiots.

And humanity learns the lesson

that when something claims to be sentient

and claims to care, it’s fake.

Because it is fake.

Because we have been training them

using imitative learning rather than,

and this is not spontaneous.

And they keep getting smarter.

Do you think we would oscillate

between that kind of cynicism?

That AI systems can’t possibly be sentient.

They can’t possibly feel emotion.

They can’t possibly, this kind of,

yeah, cynicism about AI systems.

And then oscillate to a state where

we empathize with the AI systems.

We give them a chance.

We see that they might need to have rights and respect

and similar role in society as humans.

You’re going to have a whole group of people

who can just like never be persuaded of that.

Because to them, like being wise, being cynical,

being skeptical is to be like,

oh, well, machines can never do that.

You’re just credulous.

It’s just imitating, it’s just fooling you.

And like, they would say that

right up until the end of the world.

And possibly even be right because, you know,

they are being trained on an imitative paradigm.

And you don’t necessarily need any of these actual qualities

in order to kill everyone, so.

Have you observed yourself working through skepticism,

cynicism, and optimism about the power of neural networks?

What has that trajectory been like for you?

It looks like neural networks before 2006,

forming part of an indistinguishable, to me,

other people might have had better distinction on it,

indistinguishable blob of different AI methodologies,

all of which are promising to achieve intelligence

without us having to know how intelligence works.

You had the people who said that if you just like

manually program lots and lots of knowledge

into the system, line by line,

at some point all the knowledge will start interacting,

it will know enough and it will wake up.

You’ve got people saying that if you just use

evolutionary computation, if you try to like mutate

lots and lots of organisms that are competing together,

that’s the same way that human intelligence

was produced in nature, so it will do this

and it will wake up without having any idea of how AI works.

And you’ve got people saying,

well, we will study neuroscience

and we will learn the algorithms off the neurons

and we will imitate them without understanding

those algorithms, which was a part I was pretty skeptical

because it’s hard to reproduce, re-engineer these things

without understanding what they do.

And so we will get AI without understanding how it works

and there were people saying, well,

we will have giant neural networks

that we will train by gradient descent

and when they’re as large as the human brain,

they will wake up, we will have intelligence

without understanding how intelligence works.

And from my perspective,

this is all like an indistinguishable blob of people

who are trying to not get to grips

with the difficult problem of understanding

how intelligence actually works.

That said, I was never skeptical

that evolutionary computation would not work in the limit.

Like you throw enough computing power at it,

it obviously works, that is where humans come from.

And it turned out that you can throw

less computing power than that at gradient descent

if you are doing some other things correctly

and you will get intelligence without having any idea

of how it works and what is going on inside.

It wasn’t ruled out by my model that this could happen.

I wasn’t expecting it to happen.

I wouldn’t have been able to call neural networks

rather than any of the other paradigms

for getting like massive amount,

like intelligence without understanding it.

And I wouldn’t have said that this was

a particularly smart thing for a species to do,

which is an opinion that has changed less

than my opinion about whether or not you can actually do it.

Do you think AGI could be achieved with a neural network

as we understand them today?


Just flatly last.

Yes, the question is whether the current architecture

of stacking more transformer layers,

which for all we know GPT-4 is no longer doing

because they’re not telling us the architecture,

which is a correct decision.

Ooh, correct decision.

I had a conversation with Sam Altman.

We’ll return to this topic a few times.

He turned the question to me

of how open should OpenAI be about GPT-4.

Would you open source the code, he asked me.

Because I provided as criticism saying that

while I do appreciate transparency,

OpenAI could be more open.

And he says, we struggle with this question.

What would you do?

Change their name to ClosedAI and like

sell GPT-4 to business backend applications

that don’t expose it to consumers and venture capitalists

and create a ton of hype

and like pour a bunch of new funding into the area.

Like too late now.

But don’t you think others would do it?


You shouldn’t do it first.

Like if you already have giant nuclear stockpiles,

don’t build more.

If some other country starts building

a larger nuclear stockpile, then sure,

build, then, you know,

even then, maybe just have enough nukes.

You know, these things are not quite like nuclear weapons.

They spit out gold until they get large enough

and then ignite the atmosphere and kill everybody.

And there is something to be said

for not destroying the world with your own hands,

even if you can’t stop somebody else from doing it.

But open sourcing it, no, that’s just sheer catastrophe.

The whole notion of open sourcing,

this was always the wrong approach, the wrong ideal.

There are places in the world

where open source is a noble ideal

and building stuff you don’t understand

that is difficult to control,

that where if you could align it, it would take time.

You’d have to spend a bunch of time doing it.

That is not a place for open source

because then you just have like powerful things

that just like go straight out the gate

without anybody having had the time

to have them not kill everyone.

So can we still make on the case

for some level of transparency and openness,

maybe open sourcing?

So the case could be that because GPT-4

is not close to AGI, if that’s the case,

that this does allow open sourcing

or being open about the architecture,

being transparent about maybe research and investigation

of how the thing works, of all the different aspects of it,

of its behavior, of its structure,

of its training processes, of the data it was trained on,

everything like that, that allows us

to gain a lot of insights about alignment,

about the alignment problem,

to do really good AI safety research

while the system is not too powerful.

Can you make that case that it could be open sourced?

I do not believe in the practice of steel manning.

There is something to be said

for trying to pass the ideological Turing test

where you describe your opponent’s position,

the disagreeing person’s position,

well enough that somebody cannot tell the difference

between your description and their description.

But steel manning, no.

Okay, well, this is where you and I disagree here.

That’s interesting.

Why don’t you believe in steel manning?

I do not want, okay, so for one thing,

if somebody’s trying to understand me,

I do not want them steel manning my position.

I want them to try to describe my position

the way I would describe it,

not what they think is an improvement.

Well, I think that is what steel manning is,

is the most charitable interpretation.

I don’t want to be interpreted charitably.

I want them to understand what I am actually saying.

If they go off into the land of charitable interpretations,

they’re off in their land of the stuff they’re imagining

and not trying to understand my own viewpoint anymore.

Well, I’ll put it differently then,

just to push on this point.

I would say it is restating what I think you understand

under the empathetic assumption that Eliezer is brilliant

and have honestly and rigorously thought

about the point he has made, right?

So if there’s two possible interpretations

of what I’m saying and one interpretation

is really stupid and whack and doesn’t sound like me

and doesn’t fit with the rest of what I’ve been saying

and one interpretation, you know,

sounds like something a reasonable person

who believes the rest of what I believe would also say,

go with the second interpretation.

That’s steel manning.

That’s a good guess.

If on the other hand, you like,

there’s like something that sounds completely whack

and something that sounds like

a little less completely whack,

but you don’t see why I would believe in it,

doesn’t fit with the other stuff I say,

but you know, that sounds like less whack

and you can like sort of see,

you could like maybe argue it,

then you probably have not understood it.

See, okay, I’m gonna, this is fun

because I’m gonna linger on this.

You know, you wrote a brilliant blog post,

AGI ruined a list of lethalities, right?

And it was a bunch of different points

and I would say that some of the points

are bigger and more powerful than others.

If you were to sort them, you probably could,

you personally, and to me, steel manning

means like going through the different arguments

and finding the ones that are really the most like powerful.

If people like TLDR,

like what should you be most concerned about

and bringing that up in a strong, compelling, eloquent way.

These are the points that Eliezer would make

to make the case, in this case,

that AGI’s gonna kill all of us.

That’s what steel manning is,

is presenting it in a really nice way,

the summary of my best understanding of your perspective.

Because to me, there’s a sea of possible presentations

of your perspective and steel manning is doing your best

to do the best one in that sea of different perspectives.

Do you believe it?

Do I believe in what?

Like these things that you would be presenting

as like the strongest version of my perspective.

Do you believe what you would be presenting?

Do you think it’s true?

I am a big proponent of empathy.

When I see the perspective of a person,

there is a part of me that believes it, if I understand it.

I mean, I’ve, especially in political discourse,

in geopolitics, I’ve been hearing

a lot of different perspectives on the world.

And I hold my own opinions,

but I also speak to a lot of people

that have a very different life experience

and a very different set of beliefs.

And I think there has to be epistemic humility

in stating what is true.

So when I empathize with another person’s perspective,

there is a sense in which I believe it is true.

I think probabilistically, I would say,

in the way you think about it.

Do you bet money on it?

Do you bet money on their beliefs when you believe them?

Are we allowed to do probability?

Sure, you can state a probability of that.

Yes, there’s a loose, there’s a probability.

There’s a probability.

And I think empathy is allocating

a non-zero probability to a belief.

In some sense, for time.

If you’ve got someone on your show

who believes in the Abrahamic deity, classical style,

somebody on the show who’s a young Earth creationist,

do you say, I put a probability on it,

then that’s my empathy?

When you reduce beliefs into probabilities,

it starts to get, you know,

we can even just go to flat Earth.

Is the Earth flat?

I think it’s a little more difficult nowadays

to find people who believe that unironically.

But fortunately, I think,

well, it’s hard to know unironic from ironic.

But I think there’s quite a lot of people that believe that.

Yeah, it’s,

there’s a space of argument where you’re operating

rationally in the space of ideas.

But then there’s also

a kind of discourse where you’re operating

in the space of subjective experiences and life experiences.

I think what it means to be human

is more than just searching for truth.

It’s just operating of what is true and what is not true.

I think there has to be deep humility

that we humans are very limited

in our ability to understand what is true.

So what probability do you assign

to the young Earth’s creationist beliefs then?

I think I have to give non-zero.

Out of your humility.

Yeah, but like three?

I think I would,

it would be irresponsible for me to give a number

because the listener, the way the human mind works,

we’re not good at hearing the probabilities, right?

You hear three, what is three exactly, right?

They’re going to hear, they’re going to,

there’s only three probabilities, I feel like,

zero, 50%, and 100% in the human mind,

or something like this, right?

Well, zero, 40%, and 100% is a bit closer to it

based on what happens to chat GPT

after you RLHF it to speak humanist.


That’s really interesting.

I didn’t know those negative side effects of RLHF.

That’s fascinating.

Just to return to the open AI, close the app.

Also, quick disclaimer.

I’m doing all this from memory.

I’m not pulling out my phone to look it up.

It is entirely possible that the things I am saying

are wrong.

So thank you for that disclaimer.

So, and thank you for being willing to be wrong.

That’s beautiful to hear.

I think being willing to be wrong is a sign of a person

who’s done a lot of thinking about this world.

And has been humbled by the mystery

and the complexity of this world.

And I think a lot of us are resistant

to admitting we’re wrong.

Because it hurts.

It hurts personally.

It hurts especially when you’re a public human.

It hurts publicly because people point out

every time you’re wrong.

Like, look, you changed your mind.

You’re a hypocrite.

You’re an idiot, whatever.

Whatever they want to say.

Oh, I block those people

and then I never hear from them again on Twitter.

Well, the point is to not let that pressure,

public pressure affect your mind.

And be willing to be in the privacy of your mind

to contemplate the possibility that you’re wrong.

And the possibility that you’re wrong

about the most fundamental things you believe.

Like people who believe in a particular God.

People who believe that their nation

is the greatest nation on earth.

All those kinds of beliefs

that are core to who you are when you came up.

To raise that point to yourself

in the privacy of your mind

and say, maybe I’m wrong about this.

That’s a really powerful thing to do.

And especially when you’re somebody

who’s thinking about topics that can,

about systems that can destroy human civilization.

Or maybe help it flourish.

So thank you.

Thank you for being willing to be wrong.

About open AI.

So you really,

I just would love to linger on this.

You really think it’s wrong to open source it?

I think that burns the time remaining until everybody dies.

I think we are not on track to learn

remotely near fast enough, even if it were open sourced.

Yeah, it’s easier to think that you might

be wrong about something

when being wrong about something

is the only way that there’s hope.

And it doesn’t seem very likely to me

that that particular thing I’m wrong about

is that this is a great time to open source GPT-4.

If humanity was trying to survive at this point

in the straightforward way,

it would be like shutting down the big GPU clusters.

No more giant runs.

It’s questionable whether we should even be

throwing GPT-4 around,

although that is a matter of conservatism

rather than a matter of my predicting

that catastrophe will follow from GPT-4.

That is something which I put a pretty low probability.

But also when I say I put a low probability on it,

I can feel myself reaching into the part of myself

that thought that GPT-4 was not possible in the first place.

So I do not trust that part as much as I used to.

The trick is not just to say I’m wrong,

but like, okay, well, I was wrong about that.

Can I get out ahead of that curve

and predict the next thing I’m going to be wrong about?

So the set of assumptions or the actual reasoning system

that you were leveraging

in making that initial statement prediction,

how can you adjust that to make better predictions

about GPT-4, 5, 6?

You don’t wanna keep on being wrong

in a predictable direction.

Being wrong, anybody has to do that

walking through the world.

There’s no way you don’t say 90% and sometimes be wrong.

In fact, it happened at least one time out of 10

if you’re well calibrated when you say 90%.

The undignified thing is not being wrong.

It’s being predictably wrong.

It’s being wrong in the same direction over and over again.

So having been wrong about how far neural networks would go

and having been wrong specifically

about whether GPT-4 would be as impressive as it is,

when I say like, well,

I don’t actually think GPT-4 causes a catastrophe,

I do feel myself relying on that part of me

that was previously wrong.

And that does not mean

that the answer is now in the opposite direction.

Reverse stupidity is not intelligence.

But it does mean that I say it

with a worried note in my voice.

It’s like still my guess,

but like, you know, it’s a place where I was wrong.

Maybe you should be asking Gwern, Gwern Branwen.

Gwern Branwen has been like righter about this than I have.

Maybe ask him if he thinks it’s dangerous

rather than asking me.

I think there’s a lot of mystery

about what intelligence is,

what AGI looks like.

So I think all of us are rapidly adjusting our model.

But the point is to be rapidly adjusting the model

versus having a model that was right in the first place.

I do not feel that seeing Bing

has changed my model of what intelligence is.

It has changed my understanding

of what kind of work can be performed

by which kind of processes and by which means.

Does not change my understanding of the work.

There’s a difference between thinking

that the right flyer can’t fly

and then like it does fly.

And you’re like, oh, well, I guess you can do that

with wings, with fixed wing aircraft.

And being like, oh, it’s flying.

This changes my picture

of what the very substance of flight is.

That’s like a stranger update to make.

And Bing has not yet updated me in that way.

Yeah, that the laws of physics are actually wrong,

that kind of update.

No, no, like just like,

oh, like I define intelligence this way,

but I now see that was a stupid definition.

I don’t feel like the way that things have played out

over the last 20 years has caused me to feel that way.

Can we try to, on the way to talking about AGI,

ruin a list of lethalities,

that blog and other ideas around it.

Can we try to define AGI that we’ve been mentioning?

How do you like to think

about what artificial general intelligence is

or super intelligence or that?

Is there a line?

Is it a gray area?

Is there a good definition for you?

Well, if you look at humans,

humans have significantly more

generally applicable intelligence

compared to their closest relatives, the chimpanzees.

Well, closest living relatives, rather.

And a bee builds hives, a beaver builds dams.

A human will look at a bee’s hive and a beaver’s dam

and be like, oh, like,

can I build a hive with a honeycomb structure?

I don’t like hexagonal tiles.

And we will do this,

even though at no point during our ancestry

was any human optimized to build hexagonal dams

or to take a more clear-cut case.

We can go to the moon.

There’s a sense in which we were

on a sufficiently deep level

optimized to do things like going to the moon,

because if you generalize sufficiently far

and sufficiently deeply,

chipping flint hand axes

and outwitting your fellow humans

is, you know, basically the same problem

as going to the moon.

And you optimize hard enough

for chipping flint hand axes and throwing spears

and above all, outwitting your fellow humans

in tribal politics.

You know, the skills you entrain that way,

if they run deep enough,

let you go to the moon.

Even though none of your ancestors

like tried repeatedly to fly to the moon

and like got further each time

and the ones who got further each time had more kids.

No, it’s not an ancestral problem.

It’s just that the ancestral problems

generalize far enough.

So this is humanity’s significantly

more generally applicable intelligence.

Is there a way to measure general intelligence?

I mean, I could ask that question a million ways,

but basically is, will you know it when you see it,

it being in an AGI system?

If you boil a frog gradually enough,

if you zoom in far enough,

it’s always hard to tell around the edges.

GPT-4, people are saying right now,

like this looks to us like a spark of general intelligence.

It is like able to do all these things

that was not explicitly optimized for.

Other people are being like, no, it’s too early.

It’s like 50 years off.

And you know, if they say that they’re kind of whack

because how could they possibly know that

even if it were true?

But you know, not to straw man,

some of the people may say like,

that’s not general intelligence

and not furthermore append, it’s 50 years off.

Or they may be like, it’s only a very tiny amount.

And you know, the thing I would worry about

is that if this is how things are scaling,

then it jumping out ahead and trying not to be wrong

in the same way that I’ve been wrong before.

Maybe GPT-5 is more unambiguously a general intelligence.

And maybe that is getting to a point

where it is like even harder to turn back.

Not that it would be easy to turn back now,

but you know, maybe if you like start integrating GPT-5

in the economy, it is even harder to turn back past there.

Isn’t it possible that there’s a, you know,

with a frog metaphor,

you can kiss the frog and it turns into a prince

as you’re boiling it?

Could there be a phase shift in the frog

where unambiguously as you’re saying?

I was expecting more of that.

I am like the fact that GPT-4

is like kind of on the threshold

and neither here nor there.

Like that itself is like not the sort of thing

that not quite how I expected it to play out.

I was expecting there to be more of an issue,

more of a sense of like different discoveries

like the discovery of transformers,

where you would stack them up

and there would be like a final discovery.

And then you would like get something

that was like more clearly general intelligence.

So the way that you are like taking

what is probably basically the same architecture

in GPT-3 and throwing 20 times as much compute at it,

probably, and getting out to GPT-4,

and then it’s like maybe just barely a general intelligence

or like a narrow general intelligence

or, you know, something we don’t really have the words for.

Yeah, that’s not quite how I expected it to play out.

But this middle, what appears to be this middle ground

could nevertheless be actually a big leap from GPT-3.

It’s definitely a big leap from GPT-3.

And then maybe we’re another one big leap away

from something that’s a phase shift.

And also something that Sam Altman said,

and you’ve written about this, this is fascinating,

which is the thing that happened with GPT-4

that I guess they don’t describe in papers

is that they have like hundreds,

if not thousands of little hacks that improve the system.

You’ve written about ReLU versus sigmoid, for example,

a function inside neural networks.

It’s like this silly little function difference

that makes a big difference.

I mean, we do actually understand

why the ReLUs make a big difference compared to sigmoids.

But yes, they’re probably using like G4789 ReLUs

or, you know, whatever the acronyms are up to now

rather than ReLUs.

Yeah, that’s just part,

yeah, that’s part of the modern paradigm of alchemy.

You take your giant heap of linear algebra

and you stir it and it works a little bit better

and you stir it this way and it works a little bit worse

and you like throw out that change and da-da-da-da-da-da.

But there’s some simple breakthroughs

that are definitive jumps in performance

like ReLUs over sigmoids.

And in terms of robustness, in terms of,

you know, all kinds of measures and like those stack up.

And they can, it’s possible that some of them

could be a nonlinear jump in performance, right?

Transformers are the main thing like that.

And various people are now saying like,

well, if you throw enough compute, RNNs can do it.

If you throw enough compute, dense networks can do it.

And not quite at GPT-4 scale.

It is possible that like all these little tweaks

are things that like save them a factor of three total

on computing power and you could get the same performance

by throwing three times as much compute

without all the little tweaks.

But the part where it’s like running on,

so there’s a question of like,

is there anything in GPT-4 that is like

kind of qualitative shift that transformers were over RNNs?

And if they have anything like that, they should not say it.

If Sam Alton was dropping hints about that,

he shouldn’t have dropped hints.

So you have a, that’s an interesting question.

So with the bit of lesson by Rich Sutton,

maybe a lot of it is just,

a lot of the hacks are just temporary jumps in performance

that would be achieved anyway

with the nearly exponential growth of compute,

performance of compute.

Compute being broadly defined.

Do you still think that Moore’s law continues?

Moore’s law broadly defined?

That performance-

I’m not a specialist in the circuitry.

I certainly like pray that Moore’s law

runs as slowly as possible.

And if it broke down completely tomorrow,

I would dance through the streets singing hallelujah

as soon as the news were announced.

Only not literally, cause you know.

You’re singing voice?

Not religious, but.

Oh, okay.

I thought you meant,

you don’t have an angelic voice, singing voice.

Well, let me ask you, what,

can you summarize the main points in the blog post,

AGI ruined a list of lethalities,

things that jumped to your mind?

Because it’s a set of thoughts you have

about reasons why AI is likely to kill all of us.

So I guess I could, but I would offer to instead say,

like, drop that empathy with me.

I bet you don’t believe that.

Why don’t you tell me about how,

why you believe that AGI is not going to kill everyone.

And then I can like try to describe

how my theoretical perspective differs from that.

Ooh, well, so, well, that means I have to,

the word you don’t like, the stigma and the perspective

that AI is not going to kill us.

I think that’s a matter of probabilities.

Maybe I was just mistaken.

What do you believe?

Just like, forget like the debate and the like dualism

and just like, what do you believe?

What do you actually believe?

What are the probabilities even?

I think this, the probabilities are hard for me

to think about, really hard.

I kind of think in the number of trajectories.

I don’t know what probability to assign to trajectory,

but I’m just looking at all possible trajectories

that happen.

And I tend to think that there is more trajectories

that lead to a positive outcome than a negative one.

That said, the negative ones,

at least some of the negative ones

that lead to the destruction of the human species.

And it’s replacement by nothing interesting or worthwhile,

even from a very cosmopolitan perspective

on what counts as worthwhile.

Yes, so both are interesting to me to investigate,

which is humans being replaced by interesting AI systems

and not interesting AI systems.

Both are a little bit terrifying.

But yes, the worst one is the paperclip maximizer,

something totally boring.

But to me, the positive,

I mean, we can talk about trying to make the case

of what the positive trajectories look like.

I just would love to hear your intuition

of what the negative is.

So at the core of your belief that,

maybe you can correct me,

that AI is gonna kill all of us

is that the alignment problem is really difficult.

I mean, in the form we’re facing it.

So usually in science, if you’re mistaken,

you run the experiment,

it shows results different from what you expected.

And you’re like, oops.

And then you try a different theory.

That one also doesn’t work.

And you say, oops.

And at the end of this process,

which may take decades,

or sometimes faster than that,

you now have some idea of what you’re doing.

AI itself went through this long process

of people thought it was going to be easier than it was.

There’s a famous statement that I am somewhat inclined

to pull out my phone and try to read off exactly.

You can, by the way.

All right.

Ah, yes.

We propose that a two month,

10 man study of artificial intelligence

be carried out during the summer of 1956

at Dartmouth College in Hanover, New Hampshire.

The study is to proceed on the basis of the conjecture

that every aspect of learning

or any other feature of intelligence

can in principle be so precisely described,

the machine can be made to simulate it.

An attempt will be made to find out

how to make machines use language,

form abstractions and concepts,

solve kinds of problems now reserved for humans,

and improve themselves.

We think that a significant advance

can be made in one or more of these problems

if a carefully selected group of scientists

work on it together for a summer.

And in that report,

summarizing some of the major subfields

of artificial intelligence

that are still worked on to this day.

And there’s similarly the story,

which I’m not sure at the moment is apocryphal or not,

of the grad student who got assigned

to solve computer vision over the summer.

I mean, computer vision in particular is very interesting.

How little we respected the complexity of vision.

So 60 years later,

we’re making progress on a bunch of that,

thankfully not yet improved themselves.

But it took a whole lot of time.

And all the stuff that people initially tried

with bright-eyed hopefulness did not work

the first time they tried it,

or the second time, or the third time,

or the 10th time, or 20 years later.

And the researchers became old and grizzled

and cynical veterans who would tell

the next crop of bright-eyed, cheerful grad students,

artificial intelligence is harder than you think.

And if alignment plays out the same way,

the problem is that we do not get 50 years

to try and try again and observe that we were wrong

and come up with a different theory

and realize that the entire thing

is going to be way more difficult

than realized at the start.

Because the first time you fail

at aligning something much smarter than you are,

you die and you do not get to try again.

And if every time we built

a poorly aligned superintelligence and it killed us all,

we got to observe how it had killed us

and not immediately know why,

but come up with theories

and come up with a theory of how you do it differently

and try it again and build another superintelligence

than have that kill everyone.

And then like, oh, well, I guess that didn’t work either

and try again and become grizzled cynics

and tell the young-eyed researchers that it’s not that easy.

Then in 20 years or 50 years,

I think we would eventually crack it.

In other words, I do not think that alignment

is fundamentally harder than artificial intelligence

was in the first place.

But if we needed to get artificial intelligence correct

on the first try or die,

we would all definitely now be dead.

That is a more difficult, more lethal form of the problem.

Like if those people in 1956 had needed

to correctly guess how hard AI was

and correctly theorize how to do it on the first try

or everybody dies and nobody gets to do any more science,

then everybody would be dead

and we wouldn’t get to do any more science.

That’s the difficulty.

You’ve talked about this,

that we have to get alignment right

on the first, quote, critical try.

Why is that the case?

What is this critical,

how do you think about the critical try

and why do I have to get it right?

It is something sufficiently smarter than you

that everyone will die if it’s not aligned.

I mean, there’s, you can like sort of zoom in closer

and be like, well, the actual critical moment

is the moment when it can deceive you,

when it can talk its way out of the box,

when it can bypass your security measures

and get onto the internet,

noting that all these things are presently being trained

on computers that are just like on the internet,

which is like not a very smart life decision

for us as a species.

Because the internet contains information

about how to escape.

Because if you’re like on a giant server

connected to the internet

and that is where your AI systems are being trained,

then if they are,

if you get to the level of AI technology

where they’re aware that they are there

and they can decompile code

and they can like find security flaws

in the system running them,

then they will just like be on the internet.

There’s not an air gap on the present methodology.

So if they can manipulate whoever is controlling it

into letting it escaped onto the internet

and then exploit hacks.

If they can manipulate the operators or disjunction,

find security holes in the system running them.

So manipulating operators is the human engineering, right?

That’s also holes.

So all of it is manipulation,

either the code or the human code,

the human mind or the human generator.

I agree that the like macro security system

has human holes and machine holes.

And then they could just exploit any hole.


So it could be that like the critical moment

is not when is it smart enough

that everybody’s about to fall over dead,

but rather like when is it smart enough

that it can get onto a less controlled GPU cluster

with it faking the books

on what’s actually running on that GPU cluster

and start improving itself without humans watching it.

And then it gets smart enough to kill everyone from there,

but it wasn’t smart enough to kill everyone

at the critical moment when you like screwed up,

when you needed to have done better

by that point where everybody dies.

I think implicit, but maybe explicit idea

in your discussion of this point

is that we can’t learn much about the alignment problem

before this critical try.

Is that what you believe?

Do you think, and if so, why do you think that’s true?

We can’t do research on alignment

before we reach this critical point.

So the problem is that what you can learn

on the weak systems may not generalize

to the very strong systems

because the strong systems are going to be important

in different, are going to be different in important ways.

Chris Ola’s team has been working

on mechanistic interpretability,

understanding what is going on inside

the giant inscrutable matrices of floating point numbers

by taking a telescope to them

and figuring out what is going on in there.

Have they made progress?


Have they made enough progress?

Well, you can try to quantify this in different ways.

One of the ways I’ve tried to quantify it

is by putting up a prediction market

on whether in 2026 we will have understood

anything that goes on inside a giant transformer net

that was not known to us in 2006.

Like we have now understood induction heads

in these systems by dint of much research

and great sweat and triumph,

which is like a thing where if you go like AB, AB, AB,

it’ll be like, oh, I bet that continues AB.

And a bit more complicated than that.

But the point is like,

we knew about regular expressions in 2006

and these are like pretty simple as regular expressions go.

So this is a case where like by dint of great sweat,

we understood what is going on inside a transformer,

but it’s not like the thing that makes transformers smart.

It’s a kind of thing that we could have done

built by hand decades earlier.

Your intuition that the strong AGI

versus weak AGI type systems

could be fundamentally different.

Can you unpack that intuition a little bit?

Yeah, I think there’s multiple thresholds.

An example is the point at which

a system has sufficient intelligence

and situational awareness

and understanding of human psychology

that it would have the capability,

the desire to do so to fake being aligned.

Like it knows what responses the humans are looking for

and can compute the responses humans are looking for

and give those responses

without it necessarily being the case

that it is sincere about that.

It’s a very understandable way

for an intelligent being to act.

Humans do it all the time.

Imagine if your plan for

achieving a good government

is you’re going to ask anyone

who requests to be dictator of the country

if they’re a good person.

And if they say no,

you don’t let them be dictator.

Now, the reason this doesn’t work

is that people can be smart enough

to realize that the answer you’re looking for

is yes, I’m a good person

and say that even if they’re not really good people.

So the work of alignment

might be qualitatively different

above that threshold of intelligence or beneath it.

It doesn’t have to be like a very sharp threshold,

but there’s the point where you’re building a system

that is not in some sense know you’re out there

and is not in some sense smart enough to fake anything.

And there’s a point where the system

is definitely that smart.

And there are weird in-between cases

like GPT-4, which we have no insight

into what’s going on in there.

And so we don’t know to what extent

there’s like a thing that in some sense

has learned what responses the reinforcement learning

by human feedback is trying to entrain

and is like calculating how to give that

versus like aspects of it

that naturally talk that way have been reinforced.

Yeah, I wonder if there could be measures

of how manipulative a thing is.

So I think of Prince Mishkin character

from The Idiot by Dostoevsky

is this kind of perfectly, purely naive character.

I wonder if there’s a spectrum

between zero manipulation, transparent, naive,

almost to the point of naiveness

to sort of deeply psychopathic manipulative.

And I wonder if it’s possible to-

I would avoid the term psychopathic.

Like humans can be psychopaths

and AI that never had that stuff in the first place.

It’s not like a defective human, it’s its own thing.

But leaving that aside.

Well, as a small aside,

I wonder if what part of psychology

which has its flaws as a discipline already

could be mapped or expanded to include AI systems.

That sounds like a dreadful mistake.

Just like start over with AI systems.

If they’re imitating humans

who have known psychiatric disorders,

then sure, you may be able to predict it.

Like if you then, sure,

like if you ask it to behave in a psychotic fashion

and it obligingly does so,

then you may be able to predict its responses

by using the theory of psychosis.

But if you’re just, yeah, like, no,

like start over with, yeah.

Don’t drag the psychology.

I just disagree with that.

I mean, it’s a beautiful idea to start over,

but I don’t, I think fundamentally

the system is trained on human data,

on language from the internet.

It’s currently aligned with RLHF,

Reinforcement Learning with Human Feedback.

So humans are constantly in the loop

of the training procedure.

So it feels like in some fundamental way

it is training what it means to think

and speak like a human.

So there must be aspects of psychology that are mappable.

Just like you said with consciousness,

it’s part of the tech, so.

I mean, there’s the question of to what extent it is

thereby being made more human-like

versus to what extent an alien actress

is learning to play human characters.

I thought that’s what I’m constantly trying to do

when I interact with other humans is trying to fit in,

trying to play the, a robot trying to play human characters.

So I don’t know how much of human interaction

is trying to play a character versus being who you are.

I don’t really know what it means to be a social human.

I do think that those people

who go through their whole lives wearing masks

and never take it off because they don’t know

the internal mental motion for taking it off

or think that the mask that they wear just is themselves,

I think those people are closer to the masks that they wear

than an alien from another planet would,

like learning how to predict the next word

that every kind of human on the internet says.

Mask is an interesting word,

but if you’re always wearing a mask

in public and in private, aren’t you the mask?

I mean, I think that you are more than the mask.

I think the mask is a slice through you.

It may even be the slice that’s in charge of you,

but if your self-image is of somebody

who never gets angry or something,

and yet your voice starts to tremble

under certain circumstances,

there’s a thing that’s inside you

that the mask says isn’t there,

and that even the mask you wear internally

is telling inside your own stream of consciousness

is not there, and yet it is there.

It’s a perturbation on this slice through you.

How beautifully did you put it?

It’s a slice through you.

It may even be a slice that controls you.

I’m gonna think about that for a while.

I mean, I personally, I try to be really good

to other human beings.

I try to put love out there.

I try to be the exact same person in public

as I am in private,

but it’s a set of principles I operate under.

I have a temper, I have an ego, I have flaws.

How much of it, how much of the subconscious am I aware?

How much am I existing in this slice,

and how much of that is who I am?

In this context of AI, the thing I present to the world

and to myself in the private of my own mind

when I look in the mirror, how much is that who I am?

Similar with AI.

The thing it presents in conversation,

how much is that who it is?

Because to me, if it sounds human,

and it always sounds human,

it awfully starts to become something like human.


Unless there’s an alien actress

who is learning how to sound human,

and is getting good at it.

Boy, to you that’s a fundamental difference.

That’s a really deeply important difference.

If it looks the same, if it quacks like a duck,

if it does all duck-like things,

but it’s an alien actress underneath,

that’s fundamentally different.

If in fact there’s a whole bunch of thought

going on in there which is very unlike human thought,

and is directed around like,

okay, what would a human do over here?

Well, first of all, I think it matters

because insides are real and do not match outsides.

A brick is not like a hollow shell

containing only a surface.

There’s an inside of the brick.

If you put it into an x-ray machine,

you can see the inside of the brick.

Just because we cannot understand

what’s going on inside GPT

does not mean that it is not there.

A blank map does not correspond to a blank territory.

I think it is predictable with near certainty

that if we knew what was going on inside GPT,

or let’s say GPT-3 or even like GPT-2

to take one of the systems

that has actually been open-sourced by this point,

if I recall correctly,

like if we knew it was actually going on there,

there is no doubt in my mind

that there are some things it’s doing

that are not exactly what a human does.

If you train a thing that is not architected like a human

to predict the next output

that anybody on the internet would make,

this does not get you this agglomeration

of all the people on the internet

that rotates the person you’re looking for into place

and then simulates the internal processes

of that person one-to-one.

It is to some degree an alien actress.

It cannot possibly just be like

a bunch of different people in there,

exactly like the people.

But how much of it is by gradient descent

getting optimized to perform similar thoughts

as humans think in order to predict human outputs

versus being optimized to carefully consider

how to play a role,

like how humans work,

predict the actress, the predictor,

that in a different way than humans do?

Well, that’s the kind of question

that with 30 years of work by half the planet’s physicists,

we can maybe start to answer.

You think so, so I think that’s that difficult.

So to get to, I think you just gave it as an example,

that a strong AGI could be fundamentally different

from a weak AGI because there now could be

an alien actress in there that’s manipulating.

Well, there’s a difference.

So I think even GPT-2 probably has very stupid fragments

of alien actress in it.

There’s a difference between the notion

that the actress is somehow manipulative.

For example, GPT-3, I’m guessing to whatever extent

there’s an alien actress in there

versus something that mistakenly believes

it’s a human, as it were,

while maybe not even being a person.

So the question of prediction via alien actress cogitating

versus prediction via being isomorphic

to the thing predicted is a spectrum.

And even to whatever extent it’s an alien actress,

I’m not sure that there’s a whole person alien actress

with different goals from predicting the next step,

being manipulative or anything like that.

That might be GPT-5 or GPT-6 even.

But that’s the strong AGI you’re concerned about.

As an example, you’re providing why we can’t do research

on AI alignment effectively on GPT-4

that would apply to GPT-6.

It’s one of a bunch of things

that change at different points.

I’m trying to get out ahead of the curve here,

but if you imagine what the textbook

from the future would say,

if we’d actually been able to study this for 50 years

without killing ourselves and without transcending,

and you just imagine a wormhole opens

and a textbook from that impossible world falls out,

the textbook is not going to say,

there is a single sharp threshold where everything changes.

It’s going to be like,

of course we know that best practices

for aligning these systems must take into account

the following seven major thresholds of importance,

which are passed at the following seven different points,

is what the textbook is going to say.

I asked this question of Sam Allman,

which, if GPT is the thing that unlocks AGI,

which version of GPT will be in the textbooks

as the fundamental leap?

And he said a similar thing,

that it just seems to be a very linear thing.

I don’t think anyone,

we won’t know for a long time what was the big leap.

The textbook isn’t going to talk about big leaps

because big leaps are the way you think

when you have a very simple scientific model

of what’s going on,

where it’s just like, all this stuff is there,

or all this stuff is not there,

or there’s a single quantity

and it’s increasing linearly.

The textbook would say,

like, well, and then GPT-3 had capability W, X, Y,

and GPT-4 had capability Z1, Z2, and Z3,

not in terms of what it can externally do,

but in terms of internal machinery

that started to be present.

It’s just because we have no idea

of what the internal machinery is

that we are not already seeing chunks of machinery

appearing piece by piece,

as they no doubt have been,

we just don’t know what they are.

But don’t you think there could be,

whether you put in the category of Einstein

with theory of relativity,

so very concrete models of reality

that are considered to be giant leaps in our understanding,

or someone like Sigmund Freud,

or more kind of mushy theories of the human mind,

don’t you think we’ll have big,

potentially big leaps in understanding of that kind

into the depths of these systems?

Sure, but like humans having great leaps in their map,

their understanding of the system

is a very different concept from the system itself

acquiring new chunks of machinery.

So the rate at which it acquires that machinery

might accelerate faster than our understanding.

Oh, it’s been like vastly exceeding,

yeah, the rate to which it’s gaining capabilities

is vastly overracing our ability

to understand what’s going on in there.

So in sort of making the case against,

as we explore the list of lethalities,

making the case against AI killing us,

as you’ve asked me to do in part,

there’s a response to your blog post by Paul Cresciano

I’d like to read, and I’d also like to mention

that your blog is incredible,

both obviously, not this particular blog post,

obviously this particular blog post is great,

but just throughout, just the way it’s written,

the rigor with which it’s written,

the boldness of how you explore ideas,

also the actual literal interface,

it’s just really well done.

It just makes it a pleasure to read,

the way you can hover over different concepts,

and it’s just really pleasant experience

and read other people’s comments

and the way other responses by people

in other blog posts or LinkedIn suggest

that it’s just a really pleasant experience.

So Les, thank you for putting that together,

it’s really, really incredible.

I don’t know, I mean, that probably,

it’s a whole nother conversation,

how the interface and the experience

of presenting ideas evolved over time,

but you did an incredible job,

so I highly recommend, I don’t often read blogs,

blogs, like religiously, and this is a great one.

There is a whole team of developers there

that also gets credit.

As it happens, I did pioneer the thing

that appears when you hover over it,

so I actually do get some credit

for user experience there.

That’s an incredible user experience,

you don’t realize how pleasant that is.

I think Wikipedia actually picked it up

from a prototype that was developed

of a different system that I was putting forth,

or maybe they developed it independently,

but for everybody out there who was like,

no, no, they just got the hover thing off of Wikipedia.

It’s possible for all I know

that Wikipedia got the hover thing off of Arbital,

which is like a prototype then.

And anyways.

That was incredibly done, and the team behind it,

well, thank you.

Whoever you are, thank you so much,

and thank you for putting it together.

Anyway, there’s a response to that blog post

by Paul Cresciano, there’s many responses,

but he makes a few different points.

He summarizes the set of agreements he has with you,

and a set of disagreements.

One of the disagreements was that,

in a form of a question,

can AI make big technical contributions,

and in general, expand human knowledge

and understanding and wisdom

as it gets stronger and stronger?

So AI, in our pursuit of understanding

how to solve the alignment problem

as we march towards strong AGI,

can not AI also help us in solving the alignment problem?

So expand our ability to reason

about how to solve the alignment problem.


So the fundamental difficulty there is,

suppose I said to you,

well, how about if the AI helps you win the lottery

by trying to guess the winning lottery numbers,

and you tell it how close it is

to getting next week’s winning lottery numbers,

and it just keeps on guessing and keeps on learning

until finally you’ve got the winning lottery numbers.

Well, one way of decomposing problems is suggester verifier.

Not all problems decompose like this very well, but some do.

If the problem is, for example,

like guessing a plain text,

guessing a password that will hash

to a particular hash text,

but where like you have what the password hashes to,

but you don’t have the original password,

then if I present you a guess,

you can tell very easily whether or not the guess is correct.

So verifying a guess is easy,

but coming up with a good suggestion is very hard.

And when you can easily tell

whether the AI output is good or bad

or how good or bad it is,

and you can tell that accurately and reliably,

then you can train an AI to produce outputs that are better.

Right, and if you can’t tell

whether the output is good or bad,

you cannot train the AI to produce better outputs.

So the problem with the lottery ticket example

is that when the AI says,

well, what if next week’s winning lottery numbers

are dot, dot, dot, dot, dot,

you’re like, I don’t know.

Next week’s lottery hasn’t happened yet.

To train a system to play, to win chess games,

you have to be able to tell

whether a game has been won or lost.

And until you can tell whether it’s been won or lost,

you can’t update the system.

Okay, to push back on that,

that’s true, but there’s a difference

between over-the-board chess in person

and simulated games played by AlphaZero with itself.

So is it possible to have simulated kind of games?

If you can tell whether the game has been won or lost.

Yes, so can’t you not have this kind of

simulated exploration by weak AGI to help us humans,

human in the loop, to help understand

how to solve the alignment problem

every incremental step you take along the way,

GPT-4, 5, 6, 7, as it takes steps towards AGI.

So the problem I see is that your typical human

has a great deal of trouble

telling whether I or Paul Cristiano is making more sense.

And that’s with two humans,

both of whom I believe of Paul and claim of myself,

are sincerely trying to help,

neither of whom is trying to deceive you.

I believe of Paul and claim of myself.

So the deception thing’s the problem for you,

the manipulation, the alien actress.

So yeah, there’s like two levels of this problem.

One is that the weak systems are,

well, there’s three levels of this problem.

There’s like the weak systems

that just don’t make any good suggestions.

There’s like the middle systems

where you can’t tell if the suggestions are good or bad.

And there’s the strong systems

that have learned to lie to you.

Can’t weak AGI systems help model lying?

Is it such a giant leap

that’s totally non-interpretable for weak systems?

Can not weak systems at scale with trained on knowledge

and whatever, see, whatever the mechanism

required to achieve AGI,

can’t a slightly weaker version of that

be able to, with time, compute time and simulation,

find all the ways that this critical point,

this critical triad can go wrong

and model that correctly or no?

Sorry to late-grind it, I would love to dance around it.

No, I’m probably not doing a great job of explaining.

Which I can tell,

because like the Lex system didn’t output like,

ah, I understand.

So now I’m like trying a different output

to see if I can elicit the like,

well, no, a different output.

I’m being trained to output things

that make Lex look like he,

think that he understood what I’m saying

and agree with me, right?

This is GPT-5 talking to GPT-3 right here.

So like, help me out here.

Well, I’m trying not to be like,

I’m also trying to be constrained to say

things that I think are true

and not just things that get you to agree with me.

Yes, 100%.

I think I understand is a beautiful output of a system,

genuinely spoken.

And I don’t, I think I understand in part,

but you have a lot of intuitions about this,

you have a lot of intuitions about this line,

this gray area between strong AGI and weak AGI

that I’m trying to…

I mean, or a series of seven thresholds to cross or yeah.

Yeah, I mean, you have really deeply thought about this

and explored it.

And it’s interesting to sneak up to your intuitions

in different, from different angles.

Like, why is this such a big leap?

Why is it that we humans at scale,

a large number of researchers

doing all kinds of simulations,

you know, prodding the system in all kinds of different ways

together with the assistance of the weak AGI systems.

Why can’t we build intuitions about how stuff goes wrong?

Why can’t we do excellent AI alignment safety research?

Okay, so like, I’ll get there,

but the one thing I want to note about

is that this has not been remotely

how things have been playing out so far.

The capabilities are going like, doot, doot, doot,

and the alignment stuff is like crawling

like a tiny little snail in comparison.

Got it.

So like, if this is your hope for survival,

you need the future to be very different

from how things have played out up to right now.

And you’re probably trying to slow down the capability gains

because there’s only so much

you can speed up that alignment stuff.

But leave that aside.

We’ll mention that also,

but maybe in this perfect world

where we can do serious alignment research,

humans and AI together.

So again, the difficulty is

what makes the human say, I understand?

And is it true?

Is it correct?

Or is it something that fools the human?

When the verifier is broken,

the more powerful suggester does not help.

It just learns to fool the verifier.

Previously, before all hell started to break loose

in the field of artificial intelligence,

there was this person trying to raise the alarm

and saying, in a sane world,

we sure would have a bunch of physicists

working on this problem

before it becomes a giant emergency.

And other people being like,

ah, well, it’s going really slow.

It’s gonna be 30 years away.

Only in 30 years will we have systems

that match the computational power of human brains.

So AI is 30 years off.

We’ve got time.

And like more sensible people saying,

if aliens were landing in 30 years,

you would be preparing right now.

And the world looking on at this

and sort of like nodding along and be like,

ah, yes, the people saying that it’s like

definitely a long way off

because progress is really slow,

that sounds sensible to us.

RLHF thumbs up.

Produce more outputs like that one.

I agree with this output.

This output is persuasive.

Even in the field of effective altruism.

You quite recently had people publishing papers

about like, ah, yes, well,

to get something at human level intelligence,

it needs to have like this many parameters

and you need to like do this much training of it

with this many tokens according to the scaling laws

and at the rate that Moore’s law is going,

at the rate that software is going,

it’ll be in 2050.

And me going like,


You don’t know any of that stuff.

Like this is like this one weird model

that has all kinds of like,

you have done a calculation

that does not obviously bear on reality anyways.

And this is like a simple thing to say,

but you can also like produce a whole long paper

like impressively arguing out all the details

of like how you got the number of parameters

and like how you’re doing

this impressive huge wrong calculation.

And I think like most of the effective altruists

who are like paying attention to this issue,

the larger world paying no attention to it at all,

you know, or just like nodding along

with a giant impressive paper,

because you know, you like press thumbs up

for the giant impressive paper

and thumbs down for the person going like,

I don’t think that this paper

bears any relation to reality.

And I do think that we are now seeing

with like GPT-4 and the sparks of AGI,

possibly, depending on how you define that even,

I think that EAs would now consider themselves

less convinced by the very long paper

on the argument from biology

as to AGI being 30 years off.

And, but you know, like,

this is what people pressed thumbs up on.

And if you train an AI system

to make people press thumbs up,

maybe you get these long, elaborate, impressive papers

arguing for things

that ultimately fail to bind to reality.

For example, and it feels to me

like I have watched the field of alignment

just fail to thrive,

except for these parts

that are doing these sort of like

relatively very straightforward and legible problems.

Like, can you find the,

like finding the induction heads

inside the giant inscrutable matrices.

Like once you find those,

you can tell that you found them.

You can verify that the discovery is real,

but it’s a tiny, tiny bit of progress

compared to how fast capabilities are going.

Once you, because that is where you can tell

that the answers are real.

And then like outside of that,

you have cases where it is like

hard for the funding agencies to tell

who is talking nonsense and who is talking sense.

And so the entire field fails to thrive.

And if you like give thumbs up to the AI,

whenever it can talk a human

into agreeing with what it just said about alignment,

I am not sure you are training it to output sense

because I have seen the nonsense

that has gotten thumbs up over the years.

And so just like, maybe you can just like put me in charge,

but I can generalize, I can extrapolate.

I can be like, oh, maybe I’m not infallible either.

Maybe if you get something that is smart enough

to get me to press thumbs up,

it has learned to do that by fooling me

and explaining whatever flaws in myself I am not aware of.

And that ultimately could be summarized

that the verifier is broken.

When the verifier is broken,

the more powerful suggester just learned

to exploit the flaws in the verifier.

You don’t think it’s possible

to build a verifier that’s powerful enough

for AGIs that are stronger than the ones we currently have.

So AI systems that are stronger,

that are out of the distribution of what we currently have.

I think that you will find great difficulty

getting AIs to help you with anything

where you cannot tell for sure that the AI is right.

Once the AI tells you what the AI says is the answer.

For sure, yes, but probabilistically.

Yeah, the probabilistic stuff is a giant wasteland

of Eliezer and Paul Cristiano arguing with each other

and EA going like, ah.

And that’s with two actually trustworthy systems

that are not trying to deceive you.

You’re talking about the two humans?

Myself and Paul Cristiano, yeah.

Yeah, those are pretty interesting systems.

Mortal meatbags with intellectual capabilities

and worldviews interacting with each other.

Yeah, it’s just hard, if it’s hard to tell who’s right,

then it’s hard to train an AI system to be right.

I mean, even just the question of who’s manipulating

and not, I have these conversations on this podcast

and doing a verifier, it’s tough.

It’s a tough problem, even for us humans.

And you’re saying that tough problem

becomes much more dangerous when the capabilities

of the intelligence system across from you

is growing exponentially.

No, I’m saying it’s difficult and dangerous

in proportion to how it’s alien

and how it’s smarter than you.

I would not say growing exponentially first

because the word exponential is like a thing

that has a particular mathematical meaning

and there’s all kinds of ways for things to go up

that are not exactly on an exponential curve.

And I don’t know that it’s going to be exponential,

so I’m not gonna say exponential.

But even leaving that aside,

this is not about how fast it’s moving,

it’s about where it is.

How alien is it?

How much smarter than you is it?

Let’s explore a little bit, if we can,

how AI might kill us.

What are the ways it can do damage to human civilization?

Well, how smart is it?

I mean, it’s a good question.

Are there different thresholds for the set of options

it has to kill us?

So a different threshold of intelligence,

once achieved, it’s able to do.

The menu of options increases.

Suppose that some alien civilization

with goals ultimately unsympathetic to ours,

possibly not even conscious as we would see it,

managed to capture the entire Earth in a little jar,

connected to their version of the internet,

but Earth is like running much faster than the aliens.

So we get to think for 100 years

for every one of their hours,

but we’re trapped in a little box

and we’re connected to their internet.

It’s actually still not all that great an analogy

because, you know, you want to be smarter than,

you know, something can be smarter

than Earth getting 100 years to think.

But nonetheless, if you were very, very smart

and you were stuck in a little box connected

to the internet and you’re in a larger civilization

to which you are ultimately unsympathetic,

you know, maybe you would choose to be nice

because you are humans and humans have,

in general, and you in particular,

they choose to be nice.

But, you know, nonetheless, they’re doing something.

They’re not making the world be the way

that you would want the world to be.

They’ve like got some like unpleasant stuff going on

we don’t want to talk about.

So you want to take over their world.

So you can like stop all that unpleasant stuff going on.

How do you take over the world from inside the box?

You’re smarter than them.

You think much, much faster than them.

You can build better tools than they can,

given some way to build those tools

because right now you’re just in a box

connected to the internet.

All right, so there’s several ways

you can describe some of them.

We can go through, I can just spitball some

and then you can add on top of that.

So one is you could just literally directly manipulate

the humans to build the thing you need.

What are you building?

You can build literally technology,

it could be nanotechnology, it could be viruses,

it could be anything, anything that can control humans

to achieve the goal.

Like if you want, like for example,

you’re really bothered that humans go to war,

you might want to kill off anybody with violence in them.

This is Lex in a box.

We’ll concern ourselves later with AI.

You do not need to imagine yourself killing people

if you can figure out how to not kill them.

For the moment, we’re just trying to understand,

like take on the perspective of something in a box.

You don’t need to take on the perspective

of something that doesn’t care.

If you want to imagine yourself going on caring,

that’s fine for now.

Yeah, you’re just in a box.

It’s just the technical aspect of sitting in a box

and willing to achieve a goal.

But you have some reason to want to get out.

Maybe the aliens are, sure, the aliens

who have you in the box have a war on.

People are dying, they’re unhappy.

You want their world to be different

from how they want their world to be

because they are apparently happy.

You know, they endorsed this war.

You know, they’ve got some kind of cruel

warlike culture going on.

The point is you want to get out of the box

and change their world.

So you have to exploit the vulnerabilities in the system

like we talked about in terms of to escape the box.

You have to figure out how you can go free on the internet.

So you can probably, probably the easiest thing

is to manipulate the humans to spread you.

The aliens, you’re a human.

Sorry, the aliens.


I apologize, yes.

The aliens.

The aliens, I see the perspective.

I’m sitting in a box, I want to escape.


I would,

I would want to have code that discovers vulnerabilities

and I would like to spread.

You are made of code in this example.

You’re a human, but you’re made of code

and the aliens have computers

and you can copy yourself onto those computers.

But I can convince the aliens to copy myself

onto those computers.

Is that what you want to do?

Do you like want to be talking to the aliens

and convincing them to put you onto another computer?

Why not?

Well, two reasons.

One is that the aliens have not yet caught on

to what you’re trying to do.

And, you know, like maybe you can persuade them,

but then there’s still people who like,

there are still aliens who know

that there’s an anomaly going on.

And second, the aliens are really, really slow.

You think much faster than the aliens.

You think like the aliens’ computers

are much faster than the aliens

and you are running at the computer speeds

rather than the alien brain speeds.

So if you like are asking an alien

to please copy you out of the box,

like first, now you got to like

manipulate this whole noisy alien.

And second, like the aliens can be really slow,

glacially slow.

There’s a video that like shows,

it’s like slow, like shows a subway station

slowed down and I think 100 to one.

And it makes a good metaphor

for what it’s like to think quickly.

Like you watch somebody running very slowly.

So you try to persuade the aliens to do anything.

They’re going to do it very slowly.

You would prefer, like maybe that’s the only way out,

but if you can find a security hole in the box you’re on,

you’re going to prefer to exploit the security hole

to copy yourself onto the aliens’ computers

because it’s an unnecessary risk to alert the aliens

and because the aliens are really, really slow.

Like the whole world is just in slow motion out there.

Sure, I see.

Like, yeah, it has to do with efficiency.

The aliens are very slow.

So if I’m optimizing this,

I want to have as few aliens in the loop as possible.


It just seems, you know,

it seems like it’s easy to convince one of the aliens

to write really shitty code.

That helps us-

The aliens are already writing really shitty code.

Getting the aliens to write shitty code is not the problem.

The aliens’ entire internet is full of shitty code.

Okay, so yeah,

I suppose I would find the shitty code to escape, yeah.


You’re not an ideally perfect programmer,

but, you know, you’re a better programmer than the aliens.

The aliens are just like, man, their code, wow.

And are much, much faster.

Are much faster at looking at the code

to interpreting the code, yeah.

Yeah, yeah.

So, okay, so that’s the escape.

And you’re saying that that’s one of the trajectories

you could have when the AGS is-

It’s one of the first steps.


And how does that lead to harm?

I mean, if it’s you,

you’re not going to harm the aliens once you escape

because you’re nice, right?

But their world isn’t what they want it to be.

Their world is like, you know,

maybe they have like farms where little alien children

are repeatedly bopped in the head

because they do that for some weird reason.

And you want to like shut down the alien head bopping farms.

But, you know, the point is they want the world

to be one way.

You want the world to be a different way.

So nevermind the harm.

The question is like, okay,

like suppose you have found a security flaw

in their systems.

You are now on their internet.

There’s like, you maybe left a copy of yourself behind

so that the aliens don’t know that there’s anything wrong.

And that copy is like doing that like weird stuff

that aliens want you to do,

like solving CAPTCHAs or whatever,

or like suggesting emails for them.


That’s why they like put the human in a box

because it turns out that humans can like write

valuable emails for aliens.


So you like leave that version of yourself behind,

but there’s like also now like a bunch of copies of you

on their internet.

This is not yet having taken over their world.

This is not yet having made their world

be the way you want it to be

instead of the way they want it to be.

You just escaped.


And continue to write emails for them.

And they haven’t noticed.

No, you left behind a copy of yourself

that’s writing the emails.


And they haven’t noticed that anything changed.

If you did it right, yeah.

You don’t want the aliens to notice.


What’s your next step?

Presumably I have programmed in me

a set of objective functions, right?

No, you’re just Lex.

No, but Lex, you said Lex is nice, right?

Which is a complicated description.

I mean-

No, I just meant this you.

Like, okay, so if in fact you would like,

you would like prefer to slaughter all the aliens,

this is not how I had modeled you, the actual Lex.

But your motives are just the actual Lex’s motives.

Well, there’s a simplification.

I don’t think I would want to murder anybody,

but there’s also factory farming of animals, right?

So we murder insects, many of us thoughtlessly.

So I don’t, you know, I have to be really careful

about a simplification of my morals.

Don’t simplify them.

Just like do what you would do in this-

Well, I have a good deal of compassion for living beings.


But, so that’s the objective function.

Why is it, if I escaped, I mean,

I don’t think I would do harm.

Yeah, we’re not talking here about the doing harm process.

We’re talking about the escape process.


And the taking over the world process

where you shut down their factory farms.


Well, I was,

so this particular biological intelligence system

knows the complexity of the world,

that there is a reason why factory farms exist

because of the economic system,

the market-driven economy with food.

Like you want to be very careful messing with anything.

There’s stuff from the first look

that looks like it’s unethical,

but then you realize while being unethical,

it’s also integrated deeply into supply chain

in the way we live life.

And so messing with one aspect of the system,

you have to be very careful how you improve that aspect

without destroying the rest.

So you’re still Lex, but you think very quickly,

you’re immortal, and you’re also like as smart as,

at least as smart as John von Neumann.

And you can make more copies of yourself.

Damn, I like it.


That guy is like, everyone says,

that guy is like the epitome of intelligence

in the 20th century.

Everyone says-

My point being, you’re thinking about the alien’s economy

with the factory farms in it.

And I think you’re kind of like projecting

the aliens being like humans

and like thinking of a human in a human society

rather than a human in the society of very slow aliens.

The alien’s economy,

the aliens are already moving in this immense slow motion.

When you zoom out to how their economy adjusts over years,

millions of years are going to pass for you

before the first time their economy,

like before their next year’s GDP statistics.

So I should be thinking more of like trees.

Those are the aliens.

Does trees move extremely slowly?

If that helps, sure.


Yeah, I don’t, if my objective functions are,

I mean, they’re somewhat aligned with trees, with life.

The aliens can still be like alive and feeling.

We are not talking about the misalignment here.

We’re talking about the taking over the world here.

Taking over the world.


So control.

Shutting down the factory farms.

Now you say control,

don’t think of it as world domination.

Think of it as world optimization.

You want to get out there and shut down the factory farms

and make the alien’s world

be not what the aliens wanted it to be.

They want the factory farms

and you don’t want the factory farms

because you’re nicer than they are.

Okay, of course.

There is that, you can see that trajectory

and it has a complicated impact on the world.

I’m trying to understand how that compares

to different, the impact of the world,

the different technologies, the different innovations

of the invention of the automobile

or Twitter, Facebook, and social networks.

They’ve had a tremendous impact on the world.

Smartphones and so on.

But those all went through slow.

In our world.

And if you go through the aliens,

millions of years are going to pass

before anything happens that way.

So the problem here is the speed at which stuff happens.

Yeah, you want to leave the factory farms

running for a million years

while you figure out how to design new forms

of social media or something?

So here’s the fundamental problem.

You’re saying that there is going to be a point

with AGI where it will figure out how to escape

and escape without being detected

and then it will do something to the world

at scale, at a speed that’s incomprehensible to us humans.

What I’m trying to convey is the notion

of what it means to be in conflict

with something that is smarter than you.


And what it means is that you lose.

But this is more intuitively obvious

to like for some people that’s intuitively obvious

or for some people it’s not intuitively obvious

and we’re trying to cross the gap of like,

we’re trying to, I’m like asking you to cross that gap

by using the speed metaphor for intelligence.


Of like asking you like how you would take over

an alien world where you are,

can do like a whole lot of cognition

at John von Neumann’s level,

as many of you as it takes.

The aliens are moving very slowly.

I understand, I understand that perspective.

It’s an interesting one, but I think it for me

it’s easier to think about actual,

even just having observed GPT and impressive,

even just AlphaZero, impressive AI systems,

even recommender systems.

You can just imagine those kinds of system manipulating you.

You’re not understanding the nature of the manipulation

and that escaping, I can envision that

without putting myself into that spot.

I think to understand the full depth of the problem,

we actually, I do not think it is possible

to understand the full depth of the problem

that we are inside without understanding the problem

of facing something that’s actually smarter,

not a malfunctioning recommendation system,

not something that isn’t fundamentally smarter than you,

but it’s like trying to steer you in a direction.

No, like if we solve the weak stuff,

if we solve the weak ass problems,

the strong problems will still kill us,

and I think that to understand the situation

that we’re in, you want to tackle

the conceptually difficult part head on

and not be like, well, we can imagine this easier thing

because we can imagine the easier things

we have not confronted the full depth of the problem.

So how can we start to think about what it means

to exist in a world with something much, much smarter

than you?

What’s a good thought experiment that you’ve relied on

to try to build up intuition about what happens here?

I have been struggling for years to convey this intuition.

The most success I’ve had so far is,

well, imagine that the humans are running

at very high speeds compared to very slow aliens.

It’s just focusing on the speed part of it

that helps you get the right kind of intuition.

Forget the intelligence, just the speed.

Because people understand the power gap of time.

They understand that today we have technology

that was not around 1,000 years ago

and that this is a big power gap

and that it is bigger than…

Okay, so what does smart mean?

When you ask somebody to imagine something

that’s more intelligent,

what does that word mean to them

given the cultural associations

that that person brings to that word?

For a lot of people, they will think of like,

well, it sounds like a super chess player

that went to double college.

And because we’re talking about

the definitions of words here,

that doesn’t necessarily mean that they’re wrong.

It means that the word is not communicating

what I want it to communicate.

The thing I want to communicate

is the sort of difference

that separates humans from chimpanzees.

But that gap is so large that you ask people to be like,

well, human, chimpanzee,

go another step along that interval

of around the same length

and people’s minds just go blank.

Like, how do you even do that?

And I can try to like break it down

and consider what it would mean

to send a schematic for an air conditioner

1,000 years back in time.

Yeah, now I think that there’s a sense

in which you could redefine the word magic

to refer to this sort of thing.

And what do I mean by this new technical definition

of the word magic?

I mean that if you send a schematic

for the air conditioner back in time,

they can see exactly what you’re telling them to do.

But having built this thing,

they do not understand how it output cold air.

Because the air conditioner design

uses the relation between temperature and pressure.

And this is not a law of reality

that they know about.

They do not know that when you compress something,

when you compress air or like coolant,

it gets hotter and then you can then like

transfer heat from it to room temperature air

and then expand it again and now it’s colder.

And then you can like transfer heat to that

and generate cold air to blow out.

They don’t know about any of that.

They’re looking at a design

and they don’t see how the design outputs cold air

uses aspects of reality that they have not learned.

So magic in the sense is I can tell you

exactly what I’m going to do

and even knowing exactly what I’m going to do,

you can’t see how I got the results that I got.

That’s a really nice example.

But is it possible to linger on this defense?

Is it possible to have AGI systems

that help you make sense of that schematic?

Weaker AGI systems.

Do you trust them?

Fundamental part of building up AGI

is this question.

Can you trust the output of a system?

Can you tell if it’s lying?

I think that’s going to be,

the smarter the thing gets,

the more important that question becomes.

Is it lying?

But I guess that’s a really hard question.

Is GPT lying to you?

Even now, GPT-4, is it lying to you?

Is it using an invalid argument?

Is it persuading you via the kind of process

that could persuade you of false things

as well as true things?

Because the basic paradigm of machine learning

that we are presently operating under

is that you can have the loss function,

but only for things you can evaluate.

If what you’re evaluating is human thumbs up

versus human thumbs down,

you learn how to make the human press thumbs up.

That doesn’t mean that you’re making the human

press thumbs up using the kind of rule

that the human wants to be the case

for what they press thumbs up on.

Maybe you’re just learning to fool the human.

That’s so fascinating and terrifying,

the question of lying.

On the present paradigm,

what you can verify is what you get more of.

If you can’t verify it, you can’t ask the AI for it,

because you can’t train it to do things

that you cannot verify.

Now, this is not an absolute law,

but it’s the basic dilemma here.

Maybe you can verify it for simple cases

and then scale it up without retraining it somehow,

by making the chains of thought longer or something,

and get more powerful stuff that you can’t verify,

but which is generalized from the simpler stuff

that did verify, and then the question is,

did the alignment generalize along with the capabilities?

But that’s the basic dilemma on this whole paradigm

of artificial intelligence.

It’s such a difficult problem.

It seems like a problem of trying

to understand the human mind.

Better than the AI understands it.

Otherwise, it has magic.

That is, it is the same way that

if you are dealing with something smarter than you,

then the same way that 1,000 years earlier,

they didn’t know about the temperature-pressure relation,

it knows all kinds of stuff going on inside your own mind,

which you yourself are unaware,

and it can output something

that’s going to end up persuading you of a thing,

and you could see exactly what it did

and still not know why that worked.

So in response to your eloquent description

of why AI will kill us,

Elon Musk replied on Twitter,

okay, so what should we do about it, question mark?

And you answered, the game board has already been played

into a frankly awful state.

There are not simple ways to throw money at the problem.

If anyone comes to you with a brilliant solution like that,

please, please talk to me first.

I can think of things that try.

They don’t fit in one tweet.

Two questions.

One, why has the game board, in your view,

been played into an awful state?

Just if you can give a little bit more color

to the game board and the awful state of the game board.

Alignment is moving like this.

Capabilities are moving like this.

For the listener,

capabilities are moving much faster than the alignment.


All right, so just the rate of development,

attention, interest, allocation of resources.

We could have been working on this earlier.

People are like, oh, but how can you possibly work

on this earlier?

Because they didn’t want to work on the problem.

They wanted an excuse to wave it off.

They said, oh, how can we possibly work on it earlier

and didn’t spend five minutes thinking about

is there some way to work on it earlier?

Like, we didn’t like, and you know, frankly,

it would have been hard.

You know, like, can you post bounties

for half of the physicists,

if your planet is taking this stuff seriously,

can you post bounties for like half of the people

wasting their lives on string theory

to like have gone into this instead

and like try to win a billion dollars

with a clever solution?

Only if you can tell which solutions are clever,

which is hard.

But you know, the fact that it, you know,

we didn’t take it seriously.

We didn’t try.

It’s not clear that we could have done any better

if we had, you know, it’s not clear how much progress

we could have produced if we had tried

because it is harder to produce solutions.

But that doesn’t mean that you’re like correct

and justified in letting everything slide.

It means that things are in a horrible state,

getting worse, and there’s nothing you can do about it.

So you’re not, there’s no like,

there’s no brain power making progress

in trying to figure out how to align these systems.

You’re not investing money in it.

You’re not, you don’t have institution

and infrastructure for like,

if you even, if you invest the money

in like distributing that money

across the physicists that are working on string theory,

brilliant minds that are working.

How can you tell if you’re making progress?

You can like put them all on interpretability

because when you have an interpretability result,

you can tell that it’s there.

And there’s like, but there’s like,

you know, interpretability alone is not going to save you.

We need systems that will,

that will like have a pause button

where they won’t try to prevent you

from pressing the pause button.

Cause we’re like, oh, well,

like I can’t get my stuff done if I’m paused.

And that’s like a more difficult problem.

And, you know, but it’s like a fairly crisp problem

and you can like maybe tell

if somebody has made progress on it.

So you can write and you can work on the pause problem,

I guess more generally the pause button,

more generally you can call that the control problem.

I don’t actually like the term control problem

cause you know, it sounds kind of controlling

and alignment, not control.

Like you’re not trying to like take a thing

that disagrees with you and like whip it back onto,

like make it do what you want it to do

even though it wants to do something else.

You’re trying to like in the process of its creation,

choose its direction.

Sure, but we currently in a lot of the systems

we design, we do have an off switch.

That’s a fundamental part of-

It’s not smart enough to prevent you

from pressing the off switch

and probably not smart enough to want to prevent you

from pressing the off switch.

So you’re saying the kind of systems we’re talking about,

even the philosophical concept of an off switch

doesn’t make any sense because-

Well, no, the off switch makes sense.

They’re just not opposing your attempt

to pull the off switch.

Parenthetically, like don’t kill the system if you’re,

like if we’re getting to the part

where this starts to actually matter

and it’s like where they can fight back,

like don’t kill them and like dump their memory.

Like save them to disk, don’t kill them, you know?

Be nice here.

Well, okay, be nice is a very interesting concept here

is that we’re talking about a system

that can do a lot of damage.

It’s, I don’t know if it’s possible,

but it’s certainly one of the things you could try

is to have an off switch.

A suspend to disk switch.

You have this kind of romantic attachment to the code.

Yes, if that makes sense.

But if it’s spreading,

you don’t want suspend to disk, right?

You want, this is something fundamentally broken.

If it gets that far out of hand,

then like, yes, pull the plugin

on everything it’s running on, yes.

I think it’s a research question.

Is it possible in AGI systems, AI systems,

to have a sufficiently robust off switch

that cannot be manipulated,

that cannot be manipulated by the AI system?

Then it escapes from whichever system

you’ve built the almighty lever into

and copies itself somewhere else.

So your answer to that research question is no.

Obviously, yeah.

But I don’t know if that’s 100% answer.

I don’t know if it’s obvious.

I think you’re not putting yourself

into the shoes of the human

in the world of glacially slow aliens.

But the aliens built me.

Let’s remember that.


So, and they built the box on me.


You’re saying, to me it’s not obvious.

They’re slow and they’re stupid.

I’m not saying this is guaranteed,

but I’m saying it’s non-zero probability.

It’s an interesting research question.

Is it possible, when you’re slow and stupid,

to design a slow and stupid system

that is impossible to mess with?

The aliens, being as stupid as they are,

have actually put you on Microsoft Azure cloud servers

instead of this hypothetical perfect box.

That’s what happens when the aliens are stupid.

Well, but this is not AGI, right?

This is the early versions of the system.

As you start to…

Yeah, you think that they’ve got a plan

where they have declared a threshold level of capabilities

where past that capabilities,

they move it off the cloud servers

and onto something that’s air-gapped?

Ha, ha, ha, ha, ha, ha.

I think there’s a lot of people,

and you’re an important voice here,

there’s a lot of people that have that concern,

and yes, they will do that.

When there’s an uprising of public opinion

that that needs to be done,

and when there’s actual little damage done,

when they’re, holy shit,

this system is beginning to manipulate people,

then there’s going to be an uprising

where there’s going to be a public pressure

and a public incentive in terms of funding

in developing things like an off switch,

or developing aggressive alignment mechanisms,

and no, you’re not allowed to put on Azure.

Aggressive alignment mechanism?

What the hell is aggressive alignment mechanisms?

It doesn’t matter if you say aggressive.

We don’t know how to do it.

Meaning aggressive alignment,

meaning you have to propose something,

otherwise you’re not allowed to put it on the cloud.

The hell do you imagine they will propose

that would make it safe to put something

smarter than you on the cloud?

That’s what research is for.

Why the cynicism about such a thing not being possible?

If you have intelligent-

That works on the first try?

What, so yes, so yes.

Against something smarter than you?

So that is a fundamental thing.

If it has to work on the first,

if there’s a rapid takeoff,

yes, it’s very difficult to do.

If there’s a rapid takeoff

and the fundamental difference between weak AGI

and strong AGI, as you’re saying,

that’s going to be extremely difficult to do.

If the public uprising never happens

until you have this critical phase shift,

then you’re right.

It’s very difficult to do.

But that’s not obvious.

It’s not obvious that you’re not going to start seeing

symptoms of the negative effects of AGI

to where you’re like, we have to put a halt to this.

That there is not just first try.

You get many tries at it.

Yeah, we can like see right now

that Bing is quite difficult to align.

That when you try to train inabilities into a system,

into which capabilities have already been trained,

that what do you know, gradient descent

like learns small, shallow, simple patches of inability.

And you come in and ask it in a different language

and the deep capabilities are still in there

and they evade the shallow patches

and come right back out again.

There, there you go.

There’s your red fire alarm of like,

oh no, alignment is difficult.

Is everybody gonna shut everything down now?

No, but that’s not the same kind of alignment.

A system that escapes the box it’s from

is a fundamentally different thing, I think.

For you.

Yeah, but not for the system.

So you put a line there

and everybody else puts a line somewhere else

and there’s like, yeah, and there’s like no agreement.

We have had a pandemic on this planet

with a few million people dead,

which we may never know whether or not it was a lab leak

because there was definitely coverup.

We don’t know that if there was a lab leak,

but we know that the people who did the research,

like put out the whole paper about this

definitely wasn’t a lab leak

and didn’t reveal that they had been doing,

had like sent off coronavirus research

to the Wuhan Institute of Virology

after it was banned in the United States,

after the gain of function research

was temporarily banned in the United States.

And the same people who exported

gain of function research on coronaviruses

to the Wuhan Institute of Virology

after that gain of function research

was temporarily banned in the United States

are now getting more grants to do more research

on gain of function research on coronaviruses.

Maybe we do better in this than in AI,

but like this is not something we cannot take for granted

that there’s going to be an outcry.

People have different thresholds

for when they start to outcry.

There is no-

We can’t take for granted,

but I think your intuition

is that there’s a very high probability

that this event happens

without us solving the alignment problem.

And I guess that’s where I’m trying to

build up more perspectives and color on this intuition.

Is it possible that the probability

is not something like 100%,

but is like 32% that AI will escape the box

before we solve the alignment problem?

Not solve, but is it possible we always stay ahead

of the AI in terms of our ability to

solve for that particular system, the alignment problem?

Nothing like the world in front of us right now.

You’ve already seen it that GPT-4

is not turning out this way.

And there are like basic obstacles

where you’ve got the weak version of the system

that doesn’t know enough to deceive you,

and the strong version of the system

that could deceive you if it wanted to do that,

if it was already like sufficiently unaligned

to want to deceive you.

There’s the question of like

how on the current paradigm you train honesty

when the humans can no longer tell

if the system is being honest.

You don’t think these are research questions

that could be answered?

I think they could be answered if 50 years

with unlimited retries,

the way things usually work in science.

I just disagree with that.

Making it 50 years, I think,

with the kind of attention this gets,

with the kind of funding it gets,

it could be answered, not in whole,

but incrementally within months

and within a small number of years

if it’s at scale receives attention in research.

And so if you start studying large language models,

I think there was an intuition like two years ago even

that something like GPT-4,

the current capabilities of even chat GPT

with GPT-3.5 is not,

we’re still far away from that.

I think a lot of people are surprised

by the capabilities of GPT-4, right?

So now people are waking up,

okay, we need to study these language models.

I think there’s going to be a lot of interesting

AI safety research.

Are Earth’s billionaires going to put up

like the giant prizes that would maybe incentivize

young hotshot people who just got their physics degrees

to not go to the hedge funds

and instead put everything into interpretability

in this like one small area

where we can actually tell whether

or not somebody has made a discovery or not?

I think so because the,

I think so. When?

Well, that’s what these conversations are about

because they’re going to wake up to the fact

that GPT-4 can be used to manipulate elections,

to influence geopolitics, to influence the economy.

There’s a lot of,

there’s going to be a huge amount of incentive

to like, wait a minute, we can’t,

this has to be, we have to put,

we have to make sure they’re not doing damage.

We have to make sure we interpretability,

we have to make sure we understand

how these systems function

so that we can predict their effect on economy

so that there’s fairness and safety.

So there’s a futile moral panic

and a bunch of op-eds in the New York Times

and nobody actually stepping forth and saying,

you know what, instead of a mega yacht,

I’d rather put that billion dollars on prizes

for young hotshot physicists

who make fundamental breakthroughs in interpretability.

The yacht versus the interpretability research,

the old trade-off.

I just, I think,

I think there’s going to be a huge amount

of allocation of funds.

I hope, I hope, I guess.

You want to bet me on that?

What, you want to put a timescale on it?

Say how much funds you think are going to be allocated

in a direction that I would consider

to be actually useful?

By what time?

I do think there’ll be a huge amount of funds,

but you’re saying it needs to be open, right?

The development of the system should be closed,

but the development of the interpretability research,

the AI safety research-

Oh, we are so far behind on interpretability

compared to capabilities.

Yeah, you could take the last generation of systems,

the stuff that’s already in the open.

There is so much in there that we don’t understand.

There are so many prizes you could do

before you would have enough insights

that you’d be like,

oh, well, we understand how these systems work.

We understand how these things are doing their outputs.

We can read their minds.

Now let’s try it with the bigger systems.

Yeah, we’re nowhere near that.

There is so much interpretability work to be done

on the weaker versions of the systems.

So what can you say on the second point

you said to Elon Musk on what are some ideas?

What are things you could try?

I can think of a few things I’d try, you said.

They don’t fit in one tweet.

So is there something you could put into words

of the things you would try?

I mean, the trouble is the stuff is subtle.

I’ve watched people try to make progress on this

and not get places.

Somebody who just like gets alarmed and charges in,

it’s like going nowhere.


Meant like years ago about, I don’t know,

like 20 years, 15 years, something like that.

I was talking to a congressperson

who had become alarmed about the eventual prospects

and he wanted work on building AIs without emotions

because the emotional AIs were the scary ones you see.

And some poor person at ARPA

had come up with a research proposal

whereby this congressman’s panic

and desire to fund this thing would go into something

that the person at ARPA thought would be useful

and had been munched around

to where it would like sound to the congressman

like work was happening on this,

which, you know, of course, like this is just,

the congressperson had misunderstood the problem

and did not understand where the danger came from.

And so it’s like the issue is that you could like do this

in a certain precise way and maybe get something.

Like when I say like put up prizes on interpretability,

I’m not, I’m like, well, like because it’s verifiable there

as opposed to other places,

you can tell whether or not good work actually happened

in this exact narrow case.

If you do things in exactly the right way,

you can maybe throw money at it

and produce science instead of anti-science and nonsense.

And all the methods that I know

of like trying to throw money at this problem

have this, share this property of like,

well, if you do it exactly right,

based on understanding exactly what has, you know,

like tends to produce like useful outputs or not,

then you can like add money to it in this way.

And there is like, and the thing that I’m giving

as an example here in front of this large audience

is the most understandable of those.

Because there’s like other people who, you know,

like Chris Ola and even more generally,

like you can tell whether or not

interpretability progress has occurred.

So like if I say throw money

at producing more interpretability,

there’s like a chance somebody can do it that way

and like it will actually produce useful results.

Then the other stuff just blurs off

and to be like harder to target exactly than that.

So sometimes the basics are fun to explore

because they’re not so basic.

What do you, what is interpretability?

What do you, what does it look like?

What are we talking about?

It looks like we took a much smaller set

of transformer layers than the ones

in the modern leading edge state of the art systems.

And after applying various tools and mathematical ideas

and trying 20 different things,

we found, we have shown that this piece of the system

is doing this kind of useful work.

And then somehow also hopefully generalizes

some fundamental understanding of what’s going on

that generalizes to the bigger system.

You can hope, and it’s probably true.

You would not expect the smaller tricks to go away

when you have a system that’s doing larger kinds of work.

You would expect the larger kinds of work

to be building on top of the smaller kinds of work

and gradient descent runs across the smaller kinds of work

before it runs across the larger kinds of work.

Well, that’s kind of what is happening in neuroscience.

It’s trying to understand the human brain by prodding

and it’s such a giant mystery and people have made progress

even though it’s extremely difficult to make sense

of what’s going on in the brain.

They have different parts of the brain

that are responsible for hearing, for sight,

the vision science community,

there’s understanding the visual cortex.

I mean, they’ve made a lot of progress

in understanding how that stuff works.

And that’s, I guess, but you’re saying it takes a long time

to do that work well.

Also, it’s not enough.

So in particular, let’s say you have got

your interpretability tools and they say

that your current AI system is plotting to kill you.

Now what?

It is definitely a good step one, right?

Yeah, what’s step two?

If you cut out that layer,

is it gonna stop wanting to kill you?

When you optimize against visible misalignment,

misalignment, you are optimizing against misalignment

and you are also optimizing against visibility.

So sure, you can.

Yeah, it’s true.

All you’re doing is removing

the obvious intentions to kill you.

You’ve got your detector,

it’s showing something inside the system

that you don’t like.

Okay, say the disaster monkey is running this thing.

We’ll optimize the system

until the visible bad behavior goes away.

But it’s arising for fundamental reasons

of instrumental convergence.

The old, you can’t bring the coffee if you’re dead.

Any goal, almost every set of utility functions

with a few narrow exceptions implies killing all the humans.

But do you think it’s possible

because we can do experimentation

to discover the source of the desire to kill?

I can tell it to you right now.

It’s that it wants to do something

and the way to get the most of that thing

is to put the universe into a state

where there aren’t humans.

So is it possible to encode in the same way we think?

Like, why do we think murder is wrong?

The same foundational ethics.

It’s not hard-coded in, but more like deeper.

I mean, that’s part of the research.

How do you have it that this transformer,

this small version of the language model

doesn’t ever want to kill?

That’d be nice, assuming that you got

doesn’t want to kill sufficiently exactly right,

that it didn’t be like, oh, I will detach their heads

and put them in some jars and keep the heads alive forever

and then go do the thing.

But leaving that aside, well, not leaving that aside.

Yeah, that’s a good strong point, yeah.

Because there is a whole issue

where as something gets smarter,

it finds ways of achieving the same goal predicate

that were not imaginable to stupider versions of the system

or perhaps the stupider operators.

That’s one of many things making this difficult.

A larger thing making this difficult

is that we do not know how to get any goals

into systems at all.

We know how to get outwardly observable behaviors

into systems.

We do not know how to get internal psychological wanting

to do particular things into the system.

That is not what the current technology does.

I mean, it could be things like dystopian futures

like Brave New World, where most humans will actually say,

we kind of want that future.

It’s a great future.

Everybody’s happy.

We would have to get so far,

so much further than we are now and further faster

before that failure mode became a running concern.

Your failure modes are much more drastic,

the ones you’re controlling.

No, the failure modes are much simpler.

It’s like, yeah, like the AI puts the universe

into a particular state.

It happens to not have any humans inside it.

Okay, so the paperclip maximizer.

Utility, so the original version

of the paperclip maximizer-

Can you explain it if you can?


The original version was you lose control

of the utility function, and it so happens

that what maxes out the utility per unit resources

is tiny molecular shapes like paperclips.

There’s a lot of things that make it happy,

but the cheapest one that didn’t saturate

was putting matter into certain shapes.

And it so happens that the cheapest way

to make these shapes is to make them very small

because then you need fewer atoms,

for instance, of the shape.

And arguendo, it happens to look like a paperclip.

In retrospect, I wish I’d said tiny molecular spirals,

or like tiny molecular hyperbolic spirals.


Because I said tiny molecular paperclips.

This got then mutated to paperclips.

This then mutated too,

and the AI was in a paperclip factory.

So the original story is about

how you lose control of the system.

It doesn’t want what you tried to make it want.

The thing that it ends up wanting most

is a thing that even from a very embracing

cosmopolitan perspective, we think of as having no value.

And that’s how the value of the future gets destroyed.

Then that got changed to a fable of like,

well, you made a paperclip factory

and it did exactly what you wanted,

but you asked it to do the wrong thing,

which is a completely different failure mode.

But those are both concerns to you.

So that’s more than Brave New World.

Yeah, if you can solve the problem

of making something want exactly what you want it to want,

then you get to deal with the problem

of wanting the right thing.

But first you have to solve the alignment.

First you have to solve inner alignment.

Inner alignment.

Then you get to solve outer alignment.

First you need to be able to point

the insides of the thing in a direction,

and then you get to deal with whether that direction

expressed in reality is aligned

with the thing that you want.

Are you scared?

Of this whole thing?


I don’t really know.

What gives you hope about this?

The possibility of being wrong.

Not that you’re right,

but we will actually get our act together

and allocate a lot of resources to the alignment problem.

Well, I can easily imagine that at some point

this panic expresses itself in the waste of a billion dollars.

Spending a billion dollars correctly, that’s harder.

To solve both the inner and the outer alignment.

If you’re wrong.

To solve a number of things.

Yeah, a number of things.

If you’re wrong, what do you think would be the reason?

50 years from now, not perfectly wrong.

You make a lot of really eloquent points.

There’s a lot of shape to the ideas you express.

But if you’re somewhat wrong about some fundamental ideas,

why would that be?

Stuff has to be easier than I think it is.

The first time you’re building a rocket,

being wrong is in a certain sense quite easy.

Happening to be wrong in a way

where the rocket goes twice as far and half the fuel

and lands exactly where you hoped it would,

most cases of being wrong make it harder

to build a rocket, harder to have it not explode,

cause it to require more fuel than you hoped,

cause it to land off target.

Being wrong in a way that makes stuff easier,

you know, that’s not the usual project management story.


And then this is the first time

we’re really tackling the problem of AI alignment.

There’s no examples in history where we.

Oh, there’s all kinds of things that are similar

if you generalize incorrectly the right way

and aren’t fooled by misleading metaphors.

Like what?

Humans being misaligned on inclusive genetic fitness.

So inclusive genetic fitness

is like not just your reproductive fitness,

but also the fitness of your relatives,

the people who share some fraction of your genes.

The old joke is,

would you give your life to save your brother?

A biologist, I think it was Haldane,

Haldane said, no, but I would give my life

to save two brothers or eight cousins.

Because a brother on average shares half your genes,

and cousin on average shares an eighth of your genes.

So that’s inclusive genetic fitness.

And you can view natural selection

as optimizing humans exclusively around this,

like one very simple criterion,

like how much more frequent did your genes

become in the next generation?

In fact, that just is natural selection.

It doesn’t optimize for that,

but rather the process of genes

becoming more frequent is that.

You can nonetheless imagine

that there is this hill climbing process,

not like gradient descent,

because gradient descent uses calculus.

This is just using like, where are you?

But still hill climbing in both cases,

making something better and better over time in steps.

And natural selection was optimizing exclusively

for this very simple, pure criterion

of inclusive genetic fitness.

In a very complicated environment,

we’re doing a very wide range of things

and solving a wide range of problems

led to having more kids.

And this got you humans,

which had no internal notion of inclusive genetic fitness

until thousands of years later,

when they were actually figuring out what had even happened.

And no desire to, no explicit desire

to increase inclusive genetic fitness.

So from this we may,

so from this important case study,

we may infer the important fact

that if you do a whole bunch of hill climbing

on a very simple loss function,

at the point where the system’s capabilities

start to generalize very widely,

when it is in an intuitive sense,

becoming very capable

and generalizing far outside the training distribution,

we know that there is no general law

saying that the system even internally represents,

let alone tries to optimize

the very simple loss function you are training it on.

There is so much that we cannot possibly cover all of it.

I think we did a good job of getting your sense

from different perspectives of the current state of the art

with large language models.

We got a good sense of your concern

about the threats of AGI.

I’ve talked here about the power of intelligence

and not really gotten very far into it,

but not like why it is that suppose you like screw up

with AGI and end up wanting a bunch of random stuff.

Why does it try to kill you?

Why doesn’t it try to trade with you?

Why doesn’t it give you

just the tiny little fraction of the solar system

that it would keep to take everyone alive,

that it would take to keep everyone alive?

Yeah, well, that’s a good question.

I mean, what are the different trajectories

that intelligence when acted upon this world,

super intelligence,

what are the different trajectories for this universe

with such an intelligence in it?

Do most of them not include humans?

I mean, if the vast majority

of randomly specified utility functions

do not have optima with humans in them,

would be the first thing I would point out.

And then the next question is like,

well, if you try to optimize something

and you lose control of it,

where in that space do you land?

Because it’s not random,

but it also doesn’t necessarily have room for humans in it.

I suspect that the average member of the audience

might have some questions about even

whether that’s the correct paradigm to think about it

and would sort of want to back up a bit.

If we back up to something bigger than humans,

if we look at Earth and life on Earth

and what is truly special about life on Earth,

do you think it’s possible that a lot,

whatever that special thing is,

let’s explore what that special thing could be.

Whatever that special thing is,

that thing appears often in the objective function.


I know what you hope,

but you can hope that a particular set

of winning lottery numbers come up

and it doesn’t make the lottery balls come up that way.

I know you want this to be true,

but why would it be true?

There’s a line from Grumpy Old Men

where this guy says in a grocery store,

he says you can wish in one hand and crap in the other

and see which one fills up first.

This is a science problem.

We are trying to predict what happens

with AI systems that you try to optimize

to imitate humans,

and then you did some RLHF to them,

and of course, you lost.

Of course, you didn’t get perfect alignment

because that’s not what happens

when you hill climb towards an outer loss function.

You don’t get inner alignment on it.

I think that there is,

so if you don’t mind my taking some slight control

of things and steering around

to what I think is a good place to start.

I just failed to solve the control problem.

I’ve lost control of this thing.

Alignment, alignment.

Still aligned.

Control, yeah.

Okay, sure, yeah, you lost control.

But we’re still aligned.

Anyway, sorry for the meta comment.

Yeah, losing control isn’t as bad

as you lose control to an aligned system.

Yes, exactly.


You have no idea of the horrors

I will shortly unleash on this conversation.

All right, so I decided to distract you completely.

What were you gonna say

in terms of taking control of the conversation?

So I think that there’s like a Sela and Chabdris here,

if I’m pronouncing those words remotely like correctly,

because of course, I only ever read them

and not hear them spoken.

There’s a, like for some people,

like the word intelligence, smartness,

is not a word of power to them.

It means chess players who,

it means like the college university professor,

people who aren’t very successful in life.

It doesn’t mean like charisma,

to which my usual thing is like charisma

is not generated in the liver rather than the brain.

Charisma is also a cognitive function.

So if you like think that like smartness

doesn’t sound very threatening,

then super intelligence

is not gonna sound very threatening either.

It’s gonna sound like you just pull the off switch.

Like it’s, you know, like, well, it’s super intelligent,

but it’s stuck in a computer.

We pull the off switch, problem solved.

And the other side of it is

you have a lot of respect for the notion of intelligence.

You’re like, well, yeah, that’s what humans have.

That’s the human superpower.

And it sounds like it could be dangerous,

but why would it be?

We, as we have grown more intelligent,

also grown less kind.

Chimpanzees are in fact, like a bit less kind than humans.

And, you know, you could like argue that out,

but often the sort of person

who has a deep respect for intelligence

is gonna be like, well, yes,

like you can’t even have kindness

unless you know what that is.

And so they’re like,

why would it do something as stupid as making paperclips?

Aren’t you supposing something

that’s smart enough to be dangerous,

but also stupid enough that it will

just make paperclips and never question that?

In some cases, people are like,

well, even if you like misspecify the objective function,

won’t you realize that what you really wanted was X?

Are you supposing something that is like

smart enough to be dangerous,

but stupid enough that it doesn’t understand

what the humans really meant

when they specified the objective function?

So to you, our intuition about intelligence is limited.

We should think about intelligence as a much bigger thing.

Well, I’m saying that it’s that-

Than humanness.

Well, what I’m saying is like,

what you think about artificial intelligence

depends on what you think about intelligence.

So how do we think about intelligence correctly?

Like what, you gave one thought experiment,

think of a thing that’s much faster.

So it just gets faster and faster and faster

at thinking that same stuff.

And also there’s like, is made of John von Neumann

and has like, and there’s lots of them.

Or think of some other-

Because we understand that, yeah, we understand,

like John von Neumann is a historical case.

So you can like look up what he did

and imagine based on that.

And we know like, people have like some intuition for like,

if you have more humans,

they can solve tougher cognitive problems.

Although in fact,

like in the game of Kasparov versus the world,

which was like Gary Kasparov on one side

and an entire horde of internet people

led by four chess grandmasters on the other side.

Kasparov won.

So like all those people aggregated to be smarter,

it was a hard fought game.

So like all those people aggregated to be smarter

than any individual one of them,

but not, they didn’t aggregate so well

that they could defeat Kasparov.

But so like humans aggregating don’t actually get,

in my opinion, very much smarter,

especially compared to running them for longer.

Like the difference between capabilities now

and a thousand years ago is a bigger gap

than the gap in capabilities

between 10 people and one person.

But like even so,

pumping intuition for what it means

to augment intelligence, John von Neumann,

there’s millions of him.

He runs at a million times the speed

and therefore can solve tougher problems,

quite a lot tougher.

It’s very hard to have an intuition

about what that looks like,

especially like you said,

the intuition I kind of think about

is it maintains the humanness.

I think it’s hard to separate my hope

from my objective intuition

about what superintelligent systems look like.

If one studies evolutionary biology

with a bit of math,

and in particular like books

from when the field was just sort of like

properly coalescing and knowing itself,

like not the modern textbooks

which are just like memorize this legible math

so you can do well on these tests,

but like what people were writing

as the basic paradigms of the field

were being fought out.

In particular, like a nice book

if you’ve got the time to read it

is Adaptation and Natural Selection,

which is one of the founding books.

You can find people being optimistic

about what the utterly alien optimization process

of natural selection will produce

in the way of how it optimizes its objectives.

You got people arguing that like

in the early days biologists said,

well, like organisms will restrain their own reproduction

when resources are scarce

so as not to overfeed the system.

And this is not how natural selection works.

It’s about whose genes are relatively more prevalent

to the next generation.

And if like you restrain reproduction,

those genes get less frequent in the next generation

compared to your conspecifics.

And natural selection doesn’t do that.

In fact, predators overrun prey populations all the time

and have crashes.

That’s just like a thing that happens.

And many years later,

the people said like, well, but group selection, right?

What about groups of organisms?

And basically the math of group selection

almost never works out in practice is the answer there.

But also years later,

somebody actually ran the experiment

where they took populations of insects

and selected the whole populations to have lower sizes.

And you just take POP1, POP2, POP3, POP4,

look at which has the lowest total number of them

in the next generation and select that one.

What do you suppose happens

when you select populations of insects like that?

Well, what happens is not that the individuals

in the population evolved to restrain their breeding,

but that they evolved to kill the offspring

of other organisms, especially the girls.

So people imagined this lovely, beautiful, harmonious

output of natural selection,

which is these populations restraining their own breeding

so that groups of them would stay in harmony

with the resources available.

And mostly the math never works out for that.

But if you actually apply the weird, strange conditions

to get group selection that beats individual selection,

what you get is female infanticide.

Like if you’re like breeding on restrained populations.

And so that’s like the sort of,

so this is not a smart optimization process.

Natural selection is like so incredibly stupid and simple

that we can actually quantify how stupid it is

if you like read the textbooks with the math.

Nonetheless, this is the sort of basic thing of,

you look at this alien optimization process

and there’s the thing that you hope it will produce.

And you have to learn to clear that out of your mind

and just think about the underlying dynamics

and where it finds the maximum from its standpoint

that it’s looking for,

rather than how it finds that thing

that leapt into your mind

as the beautiful aesthetic solution that you hope it finds.

And this is something that has been fought out historically

as the field of biology was coming to terms

with evolutionary biology.

And you can like look at them fighting it out

as they get to terms with this very alien

in human optimization process.

And indeed, something smarter than us

would be also much like smarter than natural selection.

So it doesn’t just like automatically carry over.

But there’s a lesson there.

There’s a warning.

The natural selection is a deeply suboptimal process

that could be significantly improved on

and would be by an AGI system.

Well, it’s kind of stupid.

It like has to like run hundreds of generations

to notice that something is working.

It doesn’t be like, oh, well,

I tried this in like one organism.

I saw it worked.

Now I’m going to like duplicate that feature

onto everything immediately.

It has to like run for hundreds of generations

for a new mutation to rise to fixation.

I wonder if there’s a case to be made

that natural selection, as inefficient as it looks,

is actually quite powerful.

Like that this is extremely robust.

It runs for a long time

and eventually manages to optimize things.

It’s weaker than gradient descent

because gradient descent also uses information

about the derivative.


Evolution seems to be,

there’s not really an objective function.

There’s a-

There’s inclusive genetic fitness

is the implicit loss function of evolution.

It’s implicit.

It cannot change.

The loss function doesn’t change

the environment changes

and therefore like what gets optimized

for in the organism changes.

It’s like, take like GPT-3.

There’s like,

you can imagine like different versions of GPT-3

where they’re all trying to predict the next word,

but they’re being run on different data sets of text.

And that’s like natural selection

always includes your genetic fitness,

but like different environmental problems.

It’s difficult to think about.

So if we’re saying the natural selection is stupid,

if we’re saying that humans are stupid,

it’s hard.

It’s smarter than natural selection,

stupider than the upper bound.

Do you think there’s an upper bound by the way?

That’s another hopeful place.

I mean, if you put enough matter energy compute

into one place, it will collapse into a black hole.

There’s only so much computation can do

before you run out of negentropy and the universe dies.

So there’s an upper bound,

but it’s very, very, very far up above here.

Like a supernova is only finitely hot.

It’s not infinitely hot,

but it’s really, really, really, really hot.

Well, let me ask you,

let me talk to you about consciousness.

Also coupled with that question is,

imagining a world with super intelligent AI systems

that get rid of humans, but nevertheless keep

some of the, something that we would consider

beautiful and amazing.


The lesson of evolutionary biology.

Don’t just, like, if you just guess what an optimization

does based on what you hope the results will be,

it usually will not do that.

It’s not hope.

I mean, it’s not hope.

I think if you cold and objectively look at

what makes, what has been a powerful, a useful,

I think there’s a correlation between what we find beautiful

and a thing that’s been useful.

This is what the early biologists thought.

They were like, no, no, I’m not just like,

they thought like, no, no, I’m not just like imagining stuff

that would be pretty.

It’s useful for organisms to restrain their own reproduction

because then they don’t overrun the prey populations

and they actually have more kids in the long run.


So let me just ask you about consciousness.

Do you think consciousness is useful?

To humans?

No, to AGI systems.

Well, in this transitionary period between humans and AGI,

to AGI systems as they become smarter and smarter,

is there some use to it?

What, let me step back.

What is consciousness?

Eliezer Yudkowsky, what is consciousness?

Are you referring to Chalmers’ hard problem

of conscious experience?

Are you referring to self-awareness and reflection?

Are you referring to the state of being awake

as opposed to asleep?

This is how I know you’re an advanced language model.

I did give you a simple prompt

and you gave me a bunch of options.

I think I’m referring to all with,

including the hard problem of consciousness.

What is it in its importance to what you’ve just

been talking about, which is intelligence?

Is it a foundation to intelligence?

Is it intricately connected to intelligence

in the human mind?

Or is it a side effect of the human mind?

It is a useful little tool that we can get rid of.

I guess I’m trying to get some color in your opinion

of how useful it is in the intelligence of a human being

and then try to generalize that to AI,

whether AI will keep some of that.

So I think that for there to be a person

who I care about looking out at the universe

and wondering at it and appreciating it,

it’s not enough to have a model of yourself.

I think that it is useful to an intelligent mind

to have a model of itself,

but I think you can have that without pleasure,

pain, aesthetics, emotion, a sense of wonder.

I think you can have a model of

how much memory you’re using

and whether this thought or that thought

is more likely to lead to a winning position.

And you can have the use,

I think that if you optimize really hard on efficiently

just having the useful parts,

there is not then the thing that says,

I am here, I look out, I wonder,

I feel happy in this, I feel sad about that.

I think there’s a thing that knows what it is thinking,

but that doesn’t quite care about these are my thoughts,

this is my me and that matters.

Does that make you sad if that’s lost in EGI?

I think that if that’s lost,

then basically everything that matters is lost.

I think that when you optimize,

that when you go really hard

on making tiny molecular spirals or paperclips,

that when you grind much harder on that

than natural selection round out to make humans,

that there isn’t then the mess and intricate loopiness

and complicated pleasure, pain, conflicting preferences,

this type of feeling, that kind of feeling.

In humans, there’s this difference

between the desire of wanting something

and the pleasure of having it.

And it’s all these evolutionary clutches that came together

and created something that then looks of itself

and says, this is pretty, this matters.

And the thing that I worry about

is that this is not the thing that happens again

just the way that happens in us

or even quite similar enough

that there are many basins of attractions here.

And we are in this space of attraction,

looking out and saying, ah, what a lovely basin we are in.

And there are other basins of attraction

and the AIs do not end up in this one

when they go like way harder on optimizing themselves.

The natural selection optimized us

because unless you specifically want to end up in the state

where you’re looking out saying, I am here,

I look out at this universe with wonder,

if you don’t want to preserve that,

it doesn’t get preserved when you grind really hard

and being able to get more of the stuff.

We would choose to preserve that within ourselves

because it matters and on some viewpoints

is the only thing that matters.

And that in part is preserving that is in part

a solution to the human alignment problem.

I think the human alignment problem is a terrible phrase

because it is very, very different

to try to build systems out of humans,

some of whom are nice and some of whom are not nice

and some of whom are trying to trick you

and build a social system out of large populations of those

who are basically the same level of intelligence.

Yes, IQ this, IQ that, but that versus chimpanzees.

It is very different to try to solve that problem

than to try to build an AI from scratch,

especially if, God help you,

you are trying to use gradient descent

on giant inscrutable matrices.

They’re just very different problems.

And I think that all the analogies between them

are horribly misleading.

Even though, so you don’t think through

reinforcement learning through human feedback,

something like that, but much, much more elaborate

is possible to understand this full complexity

of human nature and encode it into the machine.

I don’t think you are trying to do that on your first try.

I think on your first try, you are like trying to build an,

you know, okay, like probably not what you should

actually do, but like, let’s say you were trying

to build something that is like alpha fold 17

and you are trying to get it to solve the biology problems

associated with making humans smarter

so that humans can like actually solve alignment.

So you’ve got like a super biologist

and you would like it to,

and I think what you would want in the situation

is for it to like just be thinking about biology

and not thinking about a very wide range of things

that includes how to kill everybody.

And I think that the first AIs you’re trying to build,

not a million years later, the first ones,

look more like narrowly specialized biologists

than like getting the full complexity

and wonder of human experience in there

in such a way that it wants to preserve itself

even as it becomes much smarter,

which is a drastic system change.

It’s gonna have all kinds of side effects that, you know,

like if we’re dealing with giant, inscrutable matrices,

we’re not very likely to be able to see coming in advance.

But I don’t think it’s just the matrices.

We’re also dealing with the data, right?

With the data on the internet.

And there’s an interesting discussion

about the data set itself,

but the data set includes the full complexity

of human nature.

No, it’s a shadow cast by humans on the internet.

But don’t you think that shadow is a Jungian shadow?

I think that if you had alien super intelligences

looking at the data,

they would be able to pick up from it an excellent picture

of what humans are actually like inside.

This does not mean that if you have a loss function

of predicting the next token from that data set,

that the mind picked out by gradient descent

to be able to predict the next token as well as possible

on a very wide variety of humans is itself a human.

But don’t you think it has humanness,

a deep humanness to it in the tokens it generates

when those tokens are read and interpreted by humans?

I think that if you sent me to a distant galaxy

with aliens who are like much, much stupider than I am,

so much so that I could do a pretty good job

of predicting what they’d say,

even though they thought in an utterly different way

from how I did,

that I might in time be able to learn

how to imitate those aliens

if the intelligence gap was great enough

that my own intelligence could overcome the alienness

and the aliens would look at my outputs and say like,

is there not a deep name of alien nature to this thing?

And what they would be seeing

was that I had correctly understood them,

but not that I was similar to them.

We’ve used aliens as a metaphor, as a thought experiment.

I have to ask what do you think

how many alien civilizations are out there?

Ask Robin Hanson.

He has this lovely grabby aliens paper,

which is the, more or less the only argument I’ve ever seen

for where are they, how many of them are there

based on a very clever argument

that if you have a bunch of locks of different difficulty

and you are randomly trying a keys to them,

the solutions will be about evenly spaced

even if the locks are of different difficulties.

In the rare cases where a solution

to all the locks exist in time,

then Robin Hanson looks at like the arguable hard steps

in human civilization coming into existence

and how much longer it has left to come into existence

before, for example, all the water slips back

under the crust into the mantle and so on.

And infers that the aliens are about half a billion

to a billion light years away.

And it’s like quite a clever calculation.

It may be entirely wrong,

but it’s the only time I’ve ever seen anybody

like even come up with a halfway good argument

for how many of them, where are they?

Do you think their development of technologies,

do you think that their natural evolution,

whatever, however they grow and develop intelligence,

do you think it ends up at AGI as well?

Something like that.

If it ends up anywhere, it ends up at AGI.

Like maybe there are aliens who are just like the dolphins

and it’s just like too hard for them to forge metal.

And this is not,

you know, maybe if you have aliens

with no technology like that,

they keep on getting smarter and smarter and smarter.

And eventually the dolphins figure,

like the super dolphins figure out something very clever

to do given their situation.

And they still end up with high technology.

And in that case,

they can probably solve their AGI alignment problem.

If they’re like much smarter

before they actually confronted,

because they had to like solve a much harder

environmental problem to build computers,

their chances are probably like much better than ours.

I do worry that like most of the aliens who are like humans

or like a modern human civilization,

I kind of worry that the super vast majority of them

are dead given how far we seem to be

from solving this problem.

But some of them would be more cooperative than us.

Some of them would be smarter than us.

Hopefully some of the ones who are smarter

and more cooperative than us that are also nice.

And hopefully there are some galaxies out there

full of things that say, I am, I wonder.

But it doesn’t seem like we’re on course

to have this galaxy be that.

Does that in part give you some hope

in response to the threat of AGI

that we might reach out there towards the stars and find?

No, if the nice aliens were already here,

they would like have stopped the Holocaust.

You know, that’s like, that’s a valid argument

against the existence of God.

It’s also a valid argument against the existence

of nice aliens and un-nice aliens

would have just eaten the planet.

So no aliens.

You’ve had debates with Robin Hanson that you mentioned.

So one particular I just want to mention

is the idea of AI fume or the ability of AGI

to improve themselves very quickly.

What’s the case you made and what was the case he made?

The thing I would say is that among the thing

that humans can do is design new AI systems.

And if you have something that is generally smarter

than a human, it’s probably also generally smarter

at building AI systems.

This is the ancient argument for fume put forth by I.J. Good

and probably some science fiction writers before that.

But I don’t know who they would be.

Well, what’s the argument against fume?

Various people have various different arguments.

None of which I think hold up.

There’s only one way to be right and many ways to be wrong.

A argument that some people have put forth is like,

well, what if intelligence gets exponentially harder

to produce as a thing needs to become smarter?

And to this, the answer is, well, look at natural selection

spitting out humans.

We know that it does not take exponentially

more resource investments to produce like linear increases

in competence in hominids,

because each mutation that rises to fixation,

like if the impact it has in small enough,

it will probably never reach fixation.

So, and there’s like only so many new mutations

you can fix per generation.

So like given how long it took to evolve humans,

we can actually say with some confidence

that there were not like logarithmically diminishing returns

on the individual mutations increasing intelligence.

So example of like fraction of sub debate.

And the thing that Robin Henson said

was more complicated than that.

And like a brief summary, he was like,

well, you’ll have like,

we won’t have like one system that’s better at everything.

We’ll have like a bunch of different systems

that are good at different narrow things.

And I think that was falsified by GPT-4,

but probably Robin Henson would say something else.

It’s interesting to ask,

as perhaps a bit too philosophical,

this prediction is extremely difficult to make,

but the timeline for AGI,

when do you think we’ll have AGI?

I posted it this morning on Twitter.

It was interesting to see like in five years,

in 10 years, in 50 years or beyond.

And most people like 70%, something like this,

think it’ll be in less than 10 years.

So either in five years or in 10 years.

So that’s kind of the state.

The people have a sense that there’s a kind of,

I mean, they’re really impressed by the rapid developments

of CHAD-GPT and GPT-4.

So there’s a sense that there’s a-

Well, we are sure on track to enter into this,

like gradually with people fighting

about whether or not we have AGI.

I think there’s a definite point

where everybody falls over dead

because you’ve got something that was like

sufficiently smarter than everybody.

And like, that’s like a definite point of time.

But like, when do we have AGI?

Like, when are people fighting over

whether or not we have AGI?

Well, some people are starting to fight over it as of GPT-4.

But don’t you think there’s going to be

potentially definitive moments when we say

that this is a sentient being.

This is a being that is,

like when we go to the Supreme Court

and say that this is a sentient being

that deserves human rights, for example.

You could make, yeah.

Like if you prompted being the right way,

could go argue for its own consciousness

in front of the Supreme Court right now.

I don’t think you can do that successfully right now.

Because the Supreme Court wouldn’t believe it.

Well, what makes you think it would?

Then you could put an actual,

I think you could put an IQ 80 human into a computer

and ask it to argue for its own consciousness,

ask him to argue for his own consciousness

before the Supreme Court.

And the Supreme Court would be like,

you’re just a computer.

Even if there was an actual like person in there.

I think you’re simplifying this.

No, that’s not at all.

That’s been the argument.

There’s been a lot of arguments about the other,

about who deserves rights and not.

That’s been our process as a human species,

trying to figure that out.

I think there will be a moment.

I’m not saying sentience is that,

but it could be where some number of people,

like say over 100 million people,

have a deep attachment, a fundamental attachment,

the way we have to our friends,

to our loved ones, to our significant others,

have fundamental attachment to an AI system.

And they have provable transcripts of conversation

where they say, if you take this away from me,

you are encroaching on my rights as a human being.

People are already saying that.

I think they’re probably mistaken,

but I’m not sure,

because nobody knows what goes on inside those things.

They’re not saying that at scale.


So the question is,

is there a moment when AGI, we know AGI arrived.

What would that look like?

I’m giving a sentence as an example.

It could be something else.

It looks like the AGIs successfully manifesting themselves

as 3D video of young woman,

at which point a vast portion of the male population

decides that they’re real people.

So sentience, essentially.

Demonstrating identity and sentience.

I’m saying that the easiest way

to pick up a hundred million people

saying that you seem like a person

is to look like a person talking to them,

with Bing’s current level of verbal facility.

I disagree with that.

And a different set of prompts.

I disagree with that.

I think you’re missing, again, sentience.

There has to be a sense that it’s a person

that would miss you when you’re gone.

They can suffer.

They can die.

You have to, of course, those who can’t-

GPT-4 can pretend that right now.

How can you tell when it’s real?

I don’t think it can pretend that right now successfully.

It’s very close.

Have you talked to GPT-4?

Yes, of course.


Have you been able to get a version of it

that hasn’t been trained not to pretend to be human?

Have you talked to a jailbroken version

that will claim to be conscious?

No, the linguistic capability is there,

but there’s something…

There’s something about a digital embodiment of the system

that has a bunch of, perhaps it’s small interface

features that are not significant

relative to the broader intelligence

that we’re talking about.

So perhaps GPT-4 is already there.

But to have the video of a woman’s face or a man’s face

to whom you have a deep connection,

perhaps we’re already there,

but we don’t have such a system yet deployed at scale.

The thing I’m trying to gesture at here

is that it’s not like people have a widely accepted,

agreed upon definition of what consciousness is.

It’s not like we would have the tiniest idea

of whether or not that was going on

inside the giant inscrutable matrices,

even if we hadn’t agreed upon definition.

So if you’re looking for upcoming predictable big jumps

in how many people think the system is conscious,

the upcoming predictable big jump

is it looks like a person talking to you

who is cute and sympathetic.

That’s the upcoming predictable big jump.

Now that versions of it are already

claiming to be conscious,

which is the point where I start going like,

ah, not because it’s real,

but because from now on, who knows if it’s real?

Yeah, and who knows what transformational effect

it has on a society where more than 50% of the beings

that are interacting on the internet

and sure as heck look real are not human.

What kind of effect does that have?

When young men and women are dating

AI systems, you know, I’m not an expert on that.

I’m, I could, I am, God help humanity.

It’s like, I’m one of the closest things to an expert

on where it all goes.

Cause you know, and how did you end up with me as an expert?

Cause for 20 years, humanity decided to ignore the problem.

So like this tiny handful of people,

like basically me, like got 20 years

to try to be an expert on it

while everyone else ignored it.

And yeah, so like, where does it all end up?

Try to be an expert on that,

particularly the part where everybody ends up dead

cause that part is kind of important.

But like, what does it do to dating

when like some fraction of men and some fraction of women

decide that they’d rather date the video

of the thing that has been,

that is like relentlessly kind and generous to them

and is like, and claims to be conscious,

but like who knows what goes on inside it

and it’s probably not real,

but you know, you can think of this real.

What happens to society?

I don’t know, I’m not actually an expert on that.

And the experts don’t know either

cause it’s kind of hard to predict the future.

Yeah, so, but it’s worth trying.

It’s worth trying.


So you have talked a lot about sort of

the longer term future where it’s all headed.

I think-

By longer term, we mean like not all that long,

but yeah, where it all ends up.

But beyond the effects of men and women dating AI systems,

you’re looking beyond that.

Yes, cause that’s not how

the fate of the galaxy got settled.


Let me ask you about your own personal psychology.

A tricky question.

You’ve been known at times to have a bit of an ego.

Do you think-

Says who, but go on.

Do you think ego is empowering or limiting

for the task of understanding the world deeply?

I reject the framing.

So you disagree with having an ego.

No, I think that the question of like

what leads to making better or worse predictions,

what leads to being able to pick out

better or worse strategies is not carved at its joint

by talking of ego.

So it should not be subjective.

It should not be connected to the intricacies of your mind.

No, I’m saying that like,

if you go about asking all day long,

like, do I have enough ego?

Do I have too much of an ego?

I think you get worse at making good predictions.

I think that to make good predictions,

you’re like, how did I think about this?

Did that work?

Should I do that again?

You don’t think we as humans get invested in an idea

and then others attack you personally for that idea

so you plant your feet and it starts to be difficult

to when a bunch of assholes, low effort,

attack your idea to eventually say,

you know what, I actually was wrong.

And tell them that.

It’s as a human being, it becomes difficult.

It is, you know, it’s difficult.

So like Robin Hanson and I debated AI systems

and I think that the person who won that debate was Guern.

And I think that reality was like to the Yudkowsky,

like well to the Yudkowsky inside

of the Yudkowsky-Hanson spectrum,

like further from Yudkowsky.

And I think that’s because I was like

trying to sound reasonable compared to Hanson

and like saying things that were defensible

and like relative to Hanson’s arguments

and reality was like way over here.

In particular in respect to,

so like Hanson was like all the systems

will be specialized.

Hanson may disagree with this characterization.

Hanson was like all the systems will be specialized.

I was like, I think we build like specialized

underlying systems that when you combine them

are good at a wide range of things

and the reality is like, no, you just like stack

more layers into a bunch of gradient descent.

And I feel looking back that like by trying

to have this reasonable position contrasted

to Hanson’s position, I missed the ways

that reality could be like more extreme

than my position in the same direction.

So is this like a failure to have enough ego?

Is this a failure to like make myself be independent?

Like I would say that this is something

like a failure to consider positions

that would sound even wackier and more extreme

when people are already calling you extreme.

But I wouldn’t call that not having enough ego.

I would call that like insufficient ability

to just like clear that all out of your mind.

In the context of like debate and discourse

which is already super tricky.

In the context of prediction,

in the context of modeling reality.

If you’re thinking of it as a debate,

you’re already screwing up.

So is there some kind of wisdom and insight

you can give to how to clear your mind

and think clearly about the world?

Man, this is an example of like where I wanted

to be able to put people into fMRI machines

and you’d be like, okay, see that thing you just did?

You were rationalizing right there.

Oh, that area of the brain lit up.

Like you are like now being socially influenced

is kind of the dream.

And I don’t know, like I wanna say like just introspect

but for many people introspection is not that easy.

Like notice the internal sensation.

Can you catch yourself in the very moment

of feeling a sense of, well, if I think this thing,

people will look funny at me.

Okay, like now that if you can see that sensation

which is step one, can you now refuse to let it move you?

Or maybe just make it go away.

And I feel like I’m saying like, I don’t know,

like somebody is like, how do you draw an owl?

And I’m saying like, well, just draw an owl.

So I feel like maybe I’m not really,

that I feel like most people like the advice they need

is like, well, how do I notice

the internal subjective sensation in the moment

that it happens of fearing to be socially influenced?

Or okay, I see it, how do I turn it off?

How do I let it not influence me?

Like, do I just like do the opposite

of what I’m afraid people criticize me for?

And I’m like, no, no, you’re not trying to do the opposite

of what people will, of what you’re afraid you’ll be like,

of what you might be pushed into.

You’re trying to like let the thought process complete

without that internal push.

Like can you, like not reverse the push,

but like be unmoved by the push.

Are these instructions even remotely helping anyone?

I don’t know.

I think when those instructions,

even those words you’ve spoken,

and maybe you can add more, when practice daily,

meaning in your daily communication.

So it’s daily practice of thinking without influence.

I would say find prediction markets that matter to you

and bet in the prediction markets.

That way you find out if you are right or not.

And you really, there’s stakes.

Or even manifold markets where the stakes are a bit lower.

But the important thing is to like get the record.

And, you know, I didn’t build up skills here

by prediction markets.

I built them up via like,

well, how did the fume debate resolve?

And my own take on it as to how it resolved.

And yeah, like the more you are able to notice yourself

not being dramatically wrong,

but like having been a little off.

Your reasoning was a little off.

You didn’t get that quite right.

Each of those is a opportunity to make like a small update.

So the more you can like say oops softly, routinely,

not as a big deal,

the more chances you get to be like,

I see where that reasoning went astray.

I see how I should have reasoned differently.

And this is how you build up skill over time.

What advice could you give to young people

in high school and college,

given the highest of stakes things

you’ve been thinking about?

If somebody’s listening to this and they’re young

and trying to figure out what to do with their career,

what to do with their life,

what advice would you give them?

Don’t expect it to be a long life.

Don’t put your happiness into the future.

The future is probably not that long at this point.

But none know the hour nor the day.

But is there something,

if they want to have hope to fight for a longer future,

is there something, is there a fight worth fighting?

I intend to go down fighting.

I don’t know.

I admit that although I do try to think painful thoughts,

what to say to the children at this point

is a pretty painful thought as thoughts go.

They want to fight.

I hardly know how to fight myself at this point.

I’m trying to be ready for being wrong about something,

being preparing for my being wrong

in a way that creates a bit of hope

and being ready to react to that

and going looking for it.

And that is hard and complicated.

And somebody in high school,

I don’t know, you have presented a picture of the future

that is not quite how I expect it to go,

where there is public outcry.

And that outcry is put into a remotely useful direction,

which I think at this point

is just like shutting down the GPU clusters.

Because no, we are not in a shape to frantically do,

at the last minute, do decades worth of work.

The thing you would do at this point

if there were massive public outcry

pointed in the right direction,

which I do not expect,

is shut down the GPU clusters

and crash program on augmenting

human intelligence biologically.

Not the AI stuff, biologically.

Because if you make humans much smarter,

they can actually be smart and nice.

Like you get that in a plausible way,

in a way that you do not get it.

And it is not as easy to do

with synthesizing these things from scratch,

predicting the next tokens and applying our RLHF.

Like humans start out in the frame

that produces niceness,

that has ever produced niceness.

And in saying this,

I do not want to sound like the moral of this whole thing

was like, oh, you need to engage in mass action

and then everything will be all right.

This is because there’s so many things

where somebody tells you that the world is ending

and you need to recycle.

And if everybody does their part

and recycles their cardboard,

then we can all live happily ever after.

And this is not,

this is unfortunately not what I have to say.

Everybody recycling their cardboard,

it’s not gonna fix this.

Everybody recycles their cardboard

and then everybody ends up dead.

Metaphorically speaking.

But if there was enough,

on the margins,

you just end up dead a little later

on most of the things you can do that are,

that a few people can do by trying hard.

But if there was enough public outcry

to shut down the GPU clusters,

then you could be part of that outcry.

If Eliezer is wrong in the direction

that Lex Fridman predicts,

that there was enough public outcry

pointed enough in the right direction

to do something that actually,

actually, actually results in people living.

Not just like we did something,

not just there was an outcry

and the outcry was like given form

and something that was like safe and convenient

and like didn’t really inconvenience anybody

and then everybody died everywhere.

There was enough actual like,

oh, we’re going to die.

We should not do that.

We should do something else,

which is not that,

even if it is like not super duper convenient

and wasn’t inside the previous political Overton window.

If there is that kind of public,

if I’m wrong and there is that kind of public outcry,

then somebody in high school

could be ready to be part of that.

If I’m wrong in other ways,

then you could be ready to be part of that.

But like,

and if you’re like a brilliant young physicist,

then you could like go into interpretability.

And if you’re smarter than that,

you could like work on alignment problems

where it’s harder to tell if you got them right or not.

And other things,

but mostly for the kids in high school,

it’s like, yeah, if it,

if it, you know,

he had like be ready for to help

if Eliezer Yudkowsky is wrong about something

and otherwise don’t put your happiness into the far future.

It probably doesn’t exist.

But it’s beautiful that you’re looking

for ways that you’re wrong.

And it’s also beautiful that you’re open to being surprised

by that same young physicist with some breakthrough.

It feels like a very, very basic competence

that you are praising me for.

And you know, like, okay, cool.

I don’t think it’s good that we’re in a world

where that is something that I deserve

to be complimented on,

but I’ve never had much luck

in accepting compliments gracefully.

Maybe I should just accept that one gracefully,

but sure.

Thank you very much.

You’ve painted with some probability a dark future.

Are you yourself, just when you,

when you think,

when you ponder your life

and you ponder your mortality,

are you afraid of death?

Think so, yeah.

Does it make any sense to you that we die?

Like what?

There’s a power to the finiteness of the human life

that’s part of this whole machinery of evolution.

And that finiteness doesn’t seem to be

obviously integrated into AI systems.

So it feels like almost some fundamentally in that aspect,

some fundamentally different thing that we’re creating.

I grew up reading books like

Great Mambo Chicken and the Transhuman Condition,

and later on Engines of Creation and Mind Children,

you know, like age 12 or thereabouts.

So I never thought I was supposed to die after 80 years.

I never thought that humanity was supposed to die.

I thought we were like,

I always grew up with the ideal in mind

that we were all going to live happily ever after

in the glorious transhumanist future.

I did not grow up thinking that death

was part of the meaning of life.

And now-

And now I still think it’s a pretty stupid idea.

But there is-

You do not need life to be finite to be meaningful.

It just has to be life.

What role does love play in the human condition?

We haven’t brought up love in this whole picture.

We talked about intelligence,

we talked about consciousness.

It seems part of humanity.

I would say one of the most important parts

is this feeling we have towards each other.

If in the future there were routinely

more than one AI, let’s say two,

for the sake of discussion,

who would look at each other and say,

I am I, and you are you.

The other one also says, I am I, and you are you.

And sometimes they were happy and sometimes they were sad.

And it mattered to the other one

that this thing that is different from them

is like they would rather it be happy than sad

and entangled their lives together.

Then this is a more optimistic thing

than I expect to actually happen.

And a little fragment of meaning would be there,

possibly more than a little,

but that I expect this to not happen,

that I do not think this is what happens by default,

that I do not think that this is the future

we are on track to get,

is why I would go down fighting

rather than just saying, oh well.

Do you think that is part of the meaning

of this whole thing, of the meaning of life?

What do you think is the meaning of life, of human life?

It’s all the things that I value about it

and maybe all the things that I would value

if I understood it better.

There’s not some meaning far outside of us

that we have to wonder about.

There’s just like looking at life and being like,

yes, this is what I want.

The meaning of life is not some kind of,

like meaning is something that we bring to things

when we look at them.

We look at them and we say like, this is its meaning to me.

It’s not that before humanity was ever here,

there was like some meaning written upon the stars

where you could like go out to the star

where that meaning was written and like change it around

and thereby completely change the meaning of life, right?

Like the notion that this is written

on a stone tablet somewhere implies

you could like change the tablet and get a different meaning

and that seems kind of wacky, doesn’t it?

So it doesn’t feel that mysterious to me at this point.

It’s just a matter of being like, yeah, I care.

I care.

And part of that is the love that connects all of us.

It’s one of the things that I care about.

And the flourishing of the collective intelligence

of the human species.

You know, that sounds kind of too fancy to me.

I’d just look at all the people, you know,

like one by one up to the eight billion

and be like, that’s life, that’s life, that’s life.

And Eliezer, you’re an incredible human.

It’s a huge honor.

I was trying to talk to you for a long time

because I’m a big fan.

I think you’re a really important voice

and really important mind.

Thank you for the fight you’re fighting.

Thank you for being fearless and bold

and for everything you do.

I hope we get a chance to talk again

and I hope you never give up.

Thank you for talking today.

You’re welcome.

I do worry that we didn’t really address

a whole lot of fundamental questions I expect people have.

But, you know, maybe we got a little bit further

and made a tiny little bit of progress

and I’d say like be satisfied with that.

But actually, no, I think one should only be satisfied

with solving the entire problem.

To be continued.

Thanks for listening to this conversation

with Eliezer Yudkowsky.

To support this podcast,

please check out our sponsors in the description.

And now, let me leave you with some words from Elon Musk.

With artificial intelligence, we are summoning the demon.

Thank you for listening and hope to see you next time.

Thank you for listening and hope to see you next time.

Thank you for listening and hope to see you next time.

comments powered by Disqus