OpenAI - GPT-4 Developer Livestream

To the GPT-4 developer demo live stream.

Honestly, it’s kind of hard for me

to believe that this day is here.

OpenAI has been building this technology really

since we started the company.

But for the past two years, we’ve

been really focused on delivering GPT-4.

That started with rebuilding our entire training stack,

actually training the model, and then

seeing what it was capable of, trying to figure out

its capabilities, its risks, working with partners

in order to test it in real-world scenarios,

really tuning its behavior, optimizing the model,

getting it available so that you can use it.

And so today, our goal is to show you

a little bit of how to make GPT-4 shine,

how to really get the most out of it, where its weaknesses are,

where we’re still working on it, and just how to really use it

as a good tool, a good partner.

So if you’re interested in participating in the stream,

if you go to our Discord, so that’s discord.gg slash

openai, there’s comments in there,

and we’ll take a couple of audience suggestions.

So the first thing I want to show you

is the first task that GPT-4 could do that we never really

got 3.5 to do.

And the way to think about this is all throughout training

that you’re constantly doing all this work.

It’s 2 AM.

The pager goes off.

You fix the model.

And you’re always wondering, is it going to work?

Is all of this effort actually going to pan out?

And so we all had a pet task that we really liked

and that we would all individually be trying to see,

is the model capable of it now?

And I’m going to show you the first one

that we had a success for 4, but never really got there for 3.5.

So I’m just going to copy the top of our blog post from today.

I’m going to paste it into our playground.

Now, this is our new chat completions playground

that came out two weeks ago.

I’m going to show you first with GPT-3.5.

4 has the same API to it, the same playground.

The way that it works is you have a system message where

you explain to the model what it’s supposed to do.

And we’ve made these models very steerable.

So you can provide it with really any instruction you

want, whatever you dream up.

And the model will adhere to it pretty well.

And in the future, it will get increasingly, increasingly

powerful at steering the model very reliably.

You can then paste whatever you want as a user.

The model will return messages as an assistant.

And the way to think of it is that we’re moving away

from just raw text in, raw text out,

where you can’t tell where different parts

of the conversation come from, but towards this much more

structured format that gives the model the opportunity to know,

well, this is the user asking me to do something

that the developer didn’t intend.

I should listen to the developer here.

All right, so now, time to actually show you

the task that I’m referring to.

So everyone’s familiar with?

Summarize.

This is an article into a sentence,

getting a little more specific, but where

every word begins with G. So this is 3.5.

Let’s see what it does.

Yeah, it kind of didn’t even try.

Just gave up on the task.

This is pretty typical for 3.5 trying

to do this particular kind of task.

If it’s sort of a very kind of stilted article or something

like that, maybe it can succeed.

But for the most part, 3.5 just gives up.

But let’s try the exact same prompt, the exact same system

message in GPT-4.

So kind of borderline whether you want to count AI or not,

but so let’s say AI doesn’t count.

That’s cheating.

So fair enough.

The model happily accepts my feedback.

So now to make sure it’s not just good for Gs,

I’d like to turn this over to the audience.

I’ll take a suggestion on what letter to try next.

In the meanwhile, while I’m waiting for our moderators

to pick the lucky, lucky letter, I will give a try with A.

But in this case, I’ll say GPT-4 is fine.

Why not?

Also, pretty good summary.

So I’ll hop over to our Discord.

All right.

Wow.

People are being a little ambitious here.

I’m really trying to put the model through the paces.

We’re going to try Q, which if you think about this

for a moment, I want the audience

to really think about how would you

do a summary of this article that all starts with Q.

It’s not easy.

It’s pretty good.

That’s pretty good.

All right.

So I’ve shown you summarizing an existing article.

I want to show you how you can flexibly combine ideas

between different articles.

So I’m going to take this article that

was on Hacker News yesterday, copy paste it

into the same conversation so it has all the context of what

we were just doing.

I’m going to say, find one common theme

between this article and the GPT-4 blog.

So this is an article about Pinecone,

which is a Python web app development framework.

And it’s making the technology more accessible, user friendly.

If you don’t think that was insightful enough,

you can always give some feedback

and say, that was not insightful enough.

Please, no, I’ll just even just leave it there.

Leave it up to the model to decide.

So bridging the gap between powerful technology

and practical applications.

Seems not bad.

And of course, you can ask for any other kind of task

you want using its flexible language understanding

and synthesis.

You can ask for something like, now turn the GPT-4 blog post

into a rhyming poem.

Picked up on opening up evals, open source for all,

helping to guide answering the call.

Which by the way, if you’d like to contribute to this model,

please give us evals.

We have an open source evaluation framework

that will help us guide and all of our users

understand what the model is capable of

and to take it to the next level.

So there we go.

This is consuming existing content

using GPT-4 with a little bit of creativity on top.

But next, I want to show you how to build with GPT-4.

What it’s like to create with it as a partner.

And so the thing we’re going to do

is we’re going to actually build a Discord bot.

I’ll build it live and show you the process,

show you debugging, show you what

the model can do, where its limitations are,

and how to work with them in order

to sort of achieve new heights.

So the first thing I’ll do is tell the model

that this time, it’s supposed to be an AI programming

assistant.

Its job is to write things out in pseudocode first

and then actually write the code.

And this approach is very helpful

to let the model break down the problem into smaller pieces.

And then that way, you’re not kind

of asking it to just come up with a super hard solution

to a problem all in one go.

It also makes it very interpretable

because you can see exactly what the model was thinking,

and you can even provide corrections if you’d like.

So here is the prompt that we’re going to ask it.

This is the kind of thing that 3.5 would totally choke on

if you’ve tried anything like it.

But so we’re going to ask for a Discord bot that

uses the GPT-4 API to read images and text.

Now, there’s one problem here, which

is this model’s training cutoff is in 2021,

which means it has not seen our new chat completions format.

So I literally just went to the blog post from two weeks ago,

copy pasted from the blog post, including the response format.

It has not seen the new image extension to that.

And so I just kind of wrote that up in just very minimal detail

about how to include images.

And now, the model can actually leverage that documentation

that it did not have memorized, that it does not know.

And in general, these models are very good at using information

that it’s been trained on in new ways

and synthesizing new content.

And you can see that right here, that it actually

wrote an entirely new bot.

Now, let’s actually see if this bot

is going to work in practice.

So you should always look through the code

to get a sense of what it does.

Don’t run untrusted code from humans or from AIs.

And one thing to note is that the Discord API has

changed a lot over time, and particularly

that there’s one feature that has changed a lot

since this model was trained.

Give it a try.

In fact, yes, we are missing the intense keyword.

This is something that came out in 2020.

So the model does know it exists,

but it doesn’t know which version of the Discord API

we’re using.

So are we out of luck?

Well, not quite.

We can just simply paste to the model exactly the error

message, not even going to say, hey,

this is from running your code.

Could you please fix it?

We’ll just let it run.

And the model says, oh, yeah.

Whoops, the intense argument.

Here’s the correct code.

Now, let’s give this a try, once again kind of making sure

that we understand what the code is doing.

Now, a second issue that can come up

is it doesn’t know what environment I’m running in.

And if you notice, it says, hey, here’s

this inscrutable error message, which

if you’ve not used Jupyter Notebook a lot with async IO

before, you probably have no idea what this means.

But fortunately, once again, you can just sort of say

to the model, hey, I’m using Jupyter,

and I’m running a JavaScript code.

I’m using Jupyter and would like to make this work.

And you fix it.

And the specific problem is that there’s already

an event loop running.

So you need to use this nest async IO library.

You need to call nest async IO.apply.

The model knows all of this, correctly instantiates

all of these pieces into the bot.

It even helpfully tells you, oh, you’re running in Jupyter.

Well, you can do this bang pip install

in order to install the package if you don’t already have it.

That was very helpful.

So now we’ll run.

And it looks like something happened.

So the first thing I’ll do is go over to our Discord.

And I will paste in a screenshot of our Discord itself.

So remember, GPT-4 is not just a language model.

It’s also a vision model.

In fact, it can flexibly accept inputs

that intersperse images and text arbitrarily,

kind of like a document.

Now, the image feature is in preview.

So this is going to be a little sneak peek.

It’s not yet publicly available.

It’s something we’re working with one partner called BeMyEyes

in order to really start to develop it

and get it ready for primetime.

But you can ask anything you like.

For example, I can’t, you know, I’ll say,

GPT-4, hello world.

Can you describe this image in painstaking detail?

All right, which, first of all,

think of how you would do this yourself.

There’s a lot of different things you could latch onto,

a lot of different pieces of the system you could describe.

And we can go over to the actual code.

And we can see that, yep, we, in fact, received the message,

have formatted an appropriate request for our API.

And now we wait because, you know,

one of the things we have to do

is we have to make the system faster.

That’s one of the things that we’re working on optimizing.

In the meanwhile, I just want to say

to the audience that’s watching,

we’ll take an audience request next.

So if you have an image and a task you’d like to accomplish,

please submit that to the Discord.

Our moderators will pick one that will run.

So we can see that the Discord,

oh, it looks like we have a response.

Perfect.

So it’s a screenshot of a Discord application interface.

Pretty good. Did not even describe it.

It knows that it’s Discord.

There’s probably Discord written there somewhere

where it just kind of knows this from prior experience.

Server icon labeled GPT-4

describes the interface in great detail.

Talks about all the people telling me

that I’m supposed to do queue.

Very kind audience.

And describes a bunch of the notification messages

and the users that are in the channel.

And so there you go.

That’s some pretty good understanding.

Now, this next one, if you notice,

first of all, we got a post,

but the model did not actually see the message.

So is this a failure of the model

or of the system around the model?

Well, we can take a look.

And if you notice here, content is an empty string.

We received a blank message contents.

The reason for this is a dirty trick

that we played on the AI.

So if you go to the Discord documentation

and you scroll through it all the way down to,

I can see it hard for me to even find, honestly,

to the message content intent.

You’ll see this was added as of September 2022

as a required field.

So in order to receive a message

that does not explicitly tag you,

you now have to include this new intent in your code.

Remember I said, intents have changed a lot over time.

This is much newer than the model is possibly able to know.

So maybe we’re out of luck.

We have to debug this by hand.

But once again, we can try to use

GPT-4’s language understanding capabilities to solve this.

Now keep in mind, this is a document of like,

I think this is like 10,000, 15,000 words,

something like that.

It’s not formatted very well.

This is literally a command a copy paste.

Like this is what it’s supposed to parse through

to find in the middle of that document that,

oh yeah, message contents, that’s required now.

But let’s see if it can do it.

So we will ask for, I am receiving blank message contents.

Can you, why could this be happening?

How do I fix it?

So one thing that’s new about GPT-4 is context length.

32,000 tokens is kind of the upper limit

that we support right now.

And the model is able to flexibly use long documents.

It’s something we’re still optimizing.

So we recommend trying it out,

but not necessarily sort of really scaling it up just yet,

unless you have an application that really benefits from it.

So if you’re really interested in long context,

please let us know.

We want to see what kinds of applications it unlocks.

But if you see, it says, oh yeah,

message content intent was not enabled.

And so you can either ask the model

to write some code for you,

or you could actually just do it the old fashioned way.

Either way is fine.

I think that this is a augmenting tool,

makes you much more productive,

but it’s still important that you are in the driver’s seat

and are the manager and knows what’s going on.

So now we’re connected once again.

And Boris, would you like to rerun the message?

Okay.

Once again, we can see that we have received it,

even though the bot was not explicitly tagged.

Seems like a pretty good description.

Interesting.

This is an interesting image, actually.

It looks like it’s a dolly generated one.

And let’s actually try this one as well.

What’s funny about this image?

Oh, it’s already been submitted.

So once again, we can verify

that it’s making the right API calls.

Squirrels do typically eat nuts.

We don’t expect them to use a camera or act like a human.

So I think that’s a pretty good explanation

of why that image is funny.

So I’m going to show you one more example

of what you can do with this model.

So I have here a nice hand-drawn mock-up of a joke website.

Definitely worthy of being put up on my refrigerator.

So I’m just going to take out my phone,

literally take a photo of this mock-up,

and then I’m going to send it to our Discord.

All right, going to send it to our Discord.

And this is, of course, the rockiest part,

making sure that we actually send it to the right channel,

which in fact, I think maybe I did not.

Sent it to the wrong channel.

It’s funny, it’s always the non-AI parts of these demos

that are the hardest part to do.

And here we go.

Technology is now solved.

And now we wait.

So the thing that’s amazing in my mind

is that what’s going on here

is we’re talking to a neural network,

and this neural network was trained

to predict what comes next.

It played this game of being shown a partial document

and then predicted what comes next

across an unimaginably large amount of content.

And from there, it learns all of these skills

that you can apply in all of these very flexible ways.

And so we can actually take now this output.

So literally we just said to output the HTML

from that picture.

And here we go.

Actual working JavaScript filled in the jokes.

For comparison, this was the original

of our mock-up.

And so there you go, going from hand-drawn beautiful art,

if I do say so myself, to working website.

And this is all just potential, right?

You can see lots of different applications.

We ourselves are still figuring out new ways to use this.

So we’re going to work with our partner.

We’re going to scale up from there,

but please be patient because it’s going to take us some time

to really make this available for everyone.

So I have one last thing to show you.

I’ve shown you reading existing content.

I’ve shown you how to build with the system as a partner.

The last thing I’m going to show

is how to work with the system to accomplish a task

that none of us like to do, but we all have to.

So you may have guessed,

the thing we’re going to do is taxes.

Now note that GPT is not a certified tax professional,

nor am I, so you should always check with your tax advisor,

but it can be helpful to understand some dense content,

to just be able to empower yourself

to be able to sort of solve problems

and get a handle on what’s happening

when you could not otherwise.

So once again, I’ll do a system message.

In this case, I’m going to tell it that it’s tax GPT,

which is not a specific thing

that we’ve trained into this model.

You can be very creative if you want with the system message

to really get the model in the mood of what is your job?

What are you supposed to do?

So I pasted in the tax code.

This is about 16 pages worth of tax code.

And there’s this question about Alice and Bob.

They got married at one point and here are their incomes

and they take a standard deduction, they’re filing jointly.

So first question,

what is their standard deduction for 2018?

So while the model is chugging,

I’m going to solve this problem by hand

to show you what’s involved.

So the standard deduction is the basic standard deduction

plus the additional.

The basic one is 200% for joint return of subparagraph C,

which is here, okay?

So additional doesn’t apply.

The limitation doesn’t apply.

Okay, none of these apply.

Oh wait, special rules for taxable year 2018,

which is the one we care about through 2025,

you have to substitute 12,000 for 3,000.

So 200% of 12,000, 24,000 is the final answer.

If you notice the model got to the same conclusion

and you can actually read through its explanation.

And to tell you the truth,

the first time I tried to approach this problem myself,

I could not figure it out.

I spent half an hour reading through the tax code,

trying to figure out this like back reference

and why there’s subparagraph,

just what’s even going on.

It was only by asking the model to spell out its reasoning,

and then I followed along that I was like,

oh, I get it now, I understand how this works.

And so that I think is where the power of this system lies.

It’s not perfect, but neither are you.

And together it’s this amplifying tool

that lets you just reach new heights.

And you can go further.

You can say, okay, now calculate their total liability.

And here we go, it’s doing the calculation.

Honestly, every time it does it, it’s just, it’s amazing.

This model is so good at mental math.

It’s way, way better than I am at mental math.

It’s not hooked up to a calculator.

Like that’s another way that you could really try

to enhance these systems,

but it has these raw capabilities that are so flexible.

It doesn’t care if it’s code,

it doesn’t care if it’s language,

it doesn’t care if it’s tax.

All of these capabilities in one system

that can be applied towards the problem that you care about,

towards your application, towards whatever you build.

And so to end it, the final thing that I will show

is a little other dose of creativity,

which is now summarize this problem into a rhyming poem.

And there we go.

A beautiful, beautiful poem about doing your taxes.

So thank you everyone for tuning in.

I hope you learned something about what the model can do,

how to work with it.

And honestly, we’re just really excited

to see what you’re going to build.

I’ve talked about OpenAI evals, please contribute.

We think that this model, improving it,

bringing it to the next level,

is something that everyone can contribute to,

and that we think it can really benefit a lot of people,

and we want your help to do that.

So thank you very much.

We’re so excited to see what you’re going to build.

Thank you.