OpenAI Scholars Demo Day 2019 ｜ OpenAI

Video

Transcript

you

hello everybody we’re gonna go ahead and

get started

fantastic welcome to open AI scholars

demo day thank you all for being here

this evening we have eight presentations

tonight as a result of our Scholars

program we had 550 people apply and

eight scholars so that makes us a bit

more competitive than Harvard but you

know we don’t want it we don’t want to

break any who and the scholars have

spent the past three months studying

machine learning full-time two of those

months have been spent on a

self-selected curriculum going through

the different skills that they need to

be able to complete a final presentation

which they will be showing to you today

this is a project that they’ve completed

in one month which I’ve been told is a

very short amount of time to complete a

experiment in machine learning when you

are just a beginner however they’ve been

assisted by our really awesome mentors

who are a combination of opening eye and

external folks and we’re really excited

to have everyone here so I’m gonna go

ahead and pass it off to Ilya who’s

gonna say a few words before we get

started

Thanks hello everyone and welcome to the

scholars demo day it’s really exciting

to see the projects that the scholars

were able to accomplish in just one

month one thing about machine learning

is that it’s not the easiest field to

enter without mentorship and working

closely with a good mentor can really

narrow down the very big surface area of

machine learning down to manageable

pieces that make progress far more

rapidly and I can say for myself I

definitely wouldn’t be where I am

without the mentorship that I received

so with this I want to say thank you and

real gratitude to all the mentors who

helped bring the scholars to where they

are right now

and finally doing a project is just in

just one months is not an easy feat at

all so congratulations on the scholars

who’ve done this and let’s see let’s see

the projects

[Applause]

thank you for coming to our demo day

today I’m going to talk about exploring

gamma the discount of the future or the

weight of the past so here’s a bit of

background for those who are unfamiliar

with reinforcement learning it is the

framework where an agent takes action in

the environment to maximize cumulative

rewards so the measure we care about X

the expected sum of total rewards and

the measure we actually optimize is they

expected sum of total discounted rewards

so the difference of here is the

discount and not only it prevents us

from getting explosive sum to infinity

but also it injects some preferences

which I’m going to talk about so what is

discount factor from economics this

discount factor gamma specifies some

intertemporal temp preferences and what

I mean by preferences is to imagine a

thought experiment where if I hand you a

full approached moral or some gamma

fraction of an apple today and then I

twist the gamma until you find the

single gamma that you are indifferent

between the two then this gamma reviews

your preference so with this preference

in mind let’s look at a toy example in

some simple grid word experiment here we

have two agent one has a low discount

preference so it will have a what prefer

immediate rewards such as the coins on

the Left which has lower value or the

agent on below has a high discount

meaning it will prefer long-term future

reward which is attainment so we can see

different preferences results in

distinctive behavior that agent could

express as a beginner 2rl algorithm we

may just ask which discount factor gamma

to use well some will tell you 0.99 or

otherwise say something close to one no

no not one for boundedness and some ways

say just try a set of commas and pick

the best so none of these answers seemed

entirely satisfactory so

I found out this about Blackwell

optimality principle actually says that

in all environments there exists an

optimal policy that simultaneously

optimal with gamma higher than some

threshold so looking at this principle

we may find it quite intuitive because

given some real world environments where

we always a maximize cumulative reward

that applies gamma equals 1 however in

real what we can now use gamma equal to

1 but having a gamma of 0.99 results in

a very similar intertemporal preference

as a gamma equal to 1 having a gamma

0.99 in the diamond case in the in the

grillwork case the agent will still pick

diamond over the coin despite having a

slightly lower discount factor so the

question I’m interested in is to tip our

algorithms always find the Blackwell

optimal policy for gammas above the

threshold well let’s go back to this

time we’ll use a lower exploration

instead of Hayek’s persian in the

previous case and we see the behavior is

the same for the low discount agent

however for the highest own agent it

cannot obtain any reward so not

necessarily in this case that Alton

power algorithm could find this kind of

optimal policy so in this work we will

demonstrate this issue and propose

methods to repair it to do that I set up

to world war environment one with sparse

reward the other with dense reward so

the agent in red is trying to collect

all the yellow coins which will give

them positive reward and they want to

avoid poison we shall give them negative

reward and lastly they they don’t want

to get trapped otherwise they will die

so the algorithm I use is from baselines

overnight spaced lines D queuing and

then the experiment setup is I only vary

the discount factor and keeping

everything else constant and here’s the

initial results on the sparse

environment so I pick a set of gammas

from 0.1 0.2 0.5 0.8 and 0.99 as you can

see on

right curve the highest gamma 0.99

yields the best performance and this

seems like there might be a stretch

between gamma 0.5 and point 9 0.8 such

that above that all gammas or high

gammas yield optimal performance so this

is quite consistent with the back

blackwell optimality principle let’s

look like how let’s look at how this

will work in a dense environment so here

we see that the highest gamma which is

in pink 0.99 is actually not doing well

it’s not even doing as well as the gamma

0.5 in this case so we see some

inconsistency with the Blackwell

optimality principle let’s dig further

so after thinking about it my hypothesis

to this behavior is that discount factor

may play a dual role in DQ and update

and specifically it not only explicitly

specifies that intertemporal preferences

which is to discount the future but

implicitly it includes some confidence

on bootstrapping of the function

approximator

which is to weighing the past so I

proposed a time variant discounting

gamma of T so that when we specify a

male pick five a fraction that’s a some

fraction of the total time steps and

during that period we are going to vary

our gamma from 0.1 to a final gamma that

we specified so that they have a linear

schedule then we wait earlier

experiences less the gamma will keep

fixed after the myopic fraction fraction

so using this simple scheme on the dense

environment

t queuing algorithm let’s compare the

experiments so the pink curve is still

the fixed gamma from the in the dense

environment and although other colorful

curves are always different myopic

fraction and to see this more clearly I

grouped all myopic fraction setups in a

single block in blue and you can see

other myopic fraction no matter how

large it is outperforms the basslines

algorithm with fixed gamma and because

0.99 what wasn’t doing well and then

with this initial myopia it could

actually become optimal so this is a

really good result and let’s see if it

can work in the path gamma which is 0.8

in the tense environment so I do the

same thing where the pink curve is the

fixed gamma 0.8 which was the best gamma

in the dense environment and all the

myopic fraction are in blue on the right

side and then we see that any male fake

fraction could eventually achieve the

same level of performance as the

original fixed gamma and just take

longer so what about the sparse

environment so I tried the same thing

where on the yeah so we have the fixed

gamma on the left we have gamma of 0.8

on the right we have gamma point 99 and

you can see subsequently as we increase

the myopic fraction the longer it takes

to reach optimal but they reach optimal

nonetheless so fixed gamma yields the

best performance in this sparse

environment but all my opaque fraction

converts to optimal eventually so it

doesn’t really hurt in the long run so

summing up from all this initial amalia

results we see that high discount gamma

could become optimal with initial myopia

and the benefits of this simple scheme

is that we don’t need to find humour

because it improves learning intense

reward environments and doesn’t harm

learning in sparse reward environment

however you may say that all our

hypothesis is that this initial myopia

will mitigate bias thus better

performance there might be a competing

hypothesis is by introducing this

initial myopia you may have more

exploration which might lead to better

performance so I’m trying to see whether

this benefits is result of bias

reduction or exploring

to do that I set up three experiment

where are fixed the discount factor and

for the baseline setup I have zero

myopia and load low exploration for the

myopia setup I have low myopia and low

exploration and for the exploration

setup I have zero myopia and high

exploration so to see the results let’s

look at the plot on the left here our

gamma is fixed to be zero point eight

across our three setups and for the

baseline it is in pink and then we see

that it has similar performance as the

myopia which is in orange and

exploration actually did worse than both

of them as it because it takes longer

and on the right we have our gamma of

0.99 which is considered the poor gamma

in the dense environment and we can see

with myopia actually outperform post

baseline and exploration and with the

exploration although is eventually

surpasses baseline a bit but still takes

a long time so we can sort of conclude

that exploration helps but not

significantly and it takes longer

training time but with myopia it

improves significantly and converges

faster so as part of the future

directions we try to formalize this dual

role intuition and run more experiments

on standard testing ground and I will

also want to compare a myopia schedule

lambda versus gamma in PPO I have some

initial results and happy to discuss

offline after the presentation so just a

recap here some related works and final

takeaway discount factor matters in deep

reinforcement learning and it has a dual

role that specifies intertemporal

preferences and also includes confidence

on bootstrapping from function

approximation and a simple myopic

schedule is a robust and effective way

to improve performance and the same

logic may work beyond EQ

and discrete action state framework

thank you I’m happy to take questions

[Applause]

yeah so in the baselines dqn they have

the similar linear schedule for

exploration where it started low I

started high and become low and stay low

for the whole time yeah

hello I was wondering if you can think

of a way so I really liked this idea of

seeing Gemma as something that kind of

enclose your confidence do you think

there’s a way to kind of like take you

uncertainty into account more directly

it seems like you’re currently doing it

over time but do you think there’s a

there’s a way to like make that more

explicit by I don’t know getting some

uncertainty estimate for your cue

function thank you it’s a really good

question I’m thinking to first try it on

the generalized advantages estimate

because it exhibits basically separates

out lamda and n gamma in the equation

where I feel like lamda is precisely the

role it plays in our confidence in

bootstrapping so I guess by varying

lambda perhaps in the algorithm PPO

could help with this issue yeah

any other questions

[Applause]

okay so I’d like to begin by thanking

everyone at open AI who organized this

event you can’t hear me how about now

okay and I’d also like to thank all of

you for attending so previously I was a

PhD student at the University of Chicago

where I studied cell biology but these

days wood excites me most is the

prospect of getting general-purpose

robots deployed in the real world and

then getting them to do useful things

and so one challenge associated with

this is that robots will have to learn

how to solve new tasks with little to no

external feedback so for the past month

I’ve been working on a project that

leverages a robot’s internal motivation

and over in order to overcome this lack

of supervision so I’ll start by placing

intrinsic motivation in the context of

reinforcement learning so in the

reinforcement learning setting you have

an agent that interacts with an

environment via its policy so the policy

takes in observations and outputs

actions so at every time step the agent

takes an action and receives a reward

from the environment so by learning to

maximize its total expected reward the

agent can find a good policy and

eventually learn how to solve a given

task but finding a good policy can

actually be very difficult when the

rewards from the environment are sparse

so what we really need our reward

functions that are intrinsic to the

agent instead of relying solely on

sparse extrinsic rewards okay so how do

you get how do you get dense intrinsic

rewards so there’s actually been a lot

of work in this area but most approach

that are a lot of approaches Center on

some notion of novelty so by seeking

novelty the agent is driven to explore

and in doing so learns new skills that

might help it solve a given task

and so this approach is actually had a

lot had a lot of success recently with

solving challenging Atari games like

Montezuma’s Revenge

however these ideas haven’t really been

applied to robotics so in this project

I’ll show you how using a simple

formulation for intrinsic rewards will

lead to nice solutions for solving

challenging robotics problems so before

I discuss the method that I use I’ll

talk about the fet robotics environments

which were developed in-house at open AI

by matthias sitting in the back and so

the observations include the position

and velocity of the gripper and it also

includes the pose the linear velocity

and angular velocity of any object that

might be in the same the action space is

continuous and it’s four dimensional and

it includes the first three dimensions

correspond to change in position along x

y&z and the fourth dimension corresponds

to opening and closing the gripper and

so again we care about the sparse reward

setting so here the agent receives at

every time step award of 0 if it solves

a task and negative one otherwise and I

just want to point out that for all of

the environments that I’m going to talk

about which include reaching pushing

sliding and picking place the agent has

at most 50 time steps to actually solve

the task ok so here’s an outline of the

method that I use so the diagram on the

left you can see that it’s very similar

to the reinforcement learning setting

that I talked about a few slides ago but

here in addition to the agent having a

policy it also has a dynamics model so

the dynamics model takes as input the

agent’s current state and its action and

it makes a prediction for the next state

and I also want to note that we we could

set up the dynamics model to make

predictions for the robots change in

state

and so here’s the intrinsic reward his

intrinsic reward that I’m using and what

you’ll notice is that it’s simply the

prediction error of the dynamics model

and so larger prediction errors should

lead to larger intrinsic rewards and so

the idea is that the the idea is that

the agent shouldn’t get stuck in regions

that it’s already explored and instead

will be encouraged to explore elsewhere

okay so for training I use simple fully

connected networks for both the policy

and the dynamics model I train the

policy using PPO which is an actor

critic architecture that was developed

here by John Shulman and briefly since

PPO is on policy I use a number of

actors in parallel to collect a large

and diverse set of data that can

actually be used to update the

parameters of both the policy and the

dynamics model okay so here’s my first

set of results so I’m applying this

method to first to the simplest of the

tasks which is reaching and here I’m

comparing a baseline PPO implementation

with no intrinsic rewards to intrinsic

rewards that were that were generated by

predicting the full neck state or by

predicting the change in state and so

for a simple task like reaching it turns

out that you don’t actually need the

intrinsic rewards but adding them you

you can see that the agents learns to

solve the task much more quickly and you

can see in the bottom panel that as the

agent learns the intrinsic reward

actually goes down okay so I next looked

at the pushing task which is actually

quite a bit more complicated than the

reaching task so unlike reaching the

baseline policy is not actually able to

solve this task but when you add in the

dense intrinsic rewards you can see that

the agent quickly learn

how to solve solve the task and near the

end of training its solving nearly 100

percent of the episodes and so when I

approach the pushing task I actually

looked at a number of different hyper

parameters and so I’ll talk about three

here briefly so I looked at the size of

the individual layers within the network

and as you might expect larger networks

tend to perform better than smaller

networks I also looked at the learning

rate for the dynamics model and so this

actually turned out to be very important

so as you can see here smaller learning

rates tend to outperform larger learning

rates and the last thing I looked at was

saying whether or not resetting the

environment early after the agent has

already solved the task will lead to

better performance so that’s where you

can see here so after the agent has

solved the task by doing an early reset

that leads to a significant boost in

performance ok so I next looked at the

pick-and-place task and this s is

actually step up in difficulty from the

pushing task but here the story is more

or less the same the baseline policy

cannot solve the task but adding in

either intrinsic reward can actually

lead to the agent learning how to solve

the task and for reasons I don’t quite

understand yet predicting the full next

state tends to outperform predicting the

change in state so that’s something I

want to look at a little bit more in the

future ok so the last thing I looked at

was a sliding task it turns out that by

comparison the sliding task is probably

more difficult than the others and so

here I actually needed a larger Network

than what was required for the others

for the other tasks and I also needed

twice as many environment interactions

for the agent to learn how to solve this

okay so to summarize consistent with

previous works my results demonstrate

that adding intrinsic rewards can

actually be useful in solving

challenging tasks with sparse rewards

although unnecessary

for solving the reaching task adding

intrinsic rewards as an exploration

bonus actually leads to improve

performance and intrinsic rewards were

actually necessary for solving a more

difficult task so in the future I plan

to look at slightly more complicated

tasks such as tool use and block

stacking and I also plan to look at

different kinds of inputs both both for

the policy and for the dynamics model

and in particular I’m really interested

in looking at combining different sensor

sensor modalities so looking at

combining images with depth maps with

contact information okay so with that

I’d like to thank Maddie for organizing

the Scholars Program I’d like to show my

appreciation to scholars past and

present I’d like to give a special thank

you to Matthias Alex and Lillian they

gave me really good feedback that I was

able to use to make modifications to

fetch environments I’d like to thank

Yura Harry Igor and Wojcik with all of

whom I’ve had really enlightening

conversations and I’d like to give a

special thank you to Yura and Harry

whose previous work was actually the

inspiration for this project and last

but not least I’d like to thank thank my

mentor for y’all who provided me with a

lot of encouragement and support over

the last couple of months so with that

I’d be happy to take any questions

[Applause]

have you tried Alice tiems is that

awfully that’s a good question probably

when I start to incorporate different

sensory modalities I’ll have to move to

using Elysee Ames also may help with the

Delta version of the inputs yeah I could

so I wanted to try it originally but the

the simple method that I used already

had such good performance that I’d sort

of put that off any other questions

was was there any stochastic

stochasticity in the environment yes so

every time the environment was reset the

target location would change and if

there were a block in the scene the

starting location of the block would

also change oh I see was there any

stochasticity in the transitions like

from one point in time to the next point

so like this kind of movement there was

no stochasticity the movements were

completely deterministic I see

how important was it to trade off the

goal-directed reward versus the

intrinsic reward like did that require

some tuning or not so much

I think most of that tuning came from

tuning the learning rate for the

dynamics model because at the if the

Dyna if the learning rate for the

dynamics model was too big then the

agent would very quickly learn how the

environment worked and so the intrinsic

rewards I would get would be really

small whereas if you delayed that it

would get larger rewards early

encouraging exploration but if you’re if

you’re referring to like the size of the

rewards the relative to rewards say

negative one versus with the intrinsic

reward was I did tune that quite a bit

turns out one works really nicely which

offsets the reward of negative one from

the environment yeah what larger rewards

don’t work and smaller intrinsic rewards

don’t work any other questions great

thank you

[Applause]

you

hi everyone thank you for being here

tonight and louder okay is it any better

okay thank you so um welcome so I will

be talking about my experiment with fine

tuning GPT two small model for question

answering tonight before I start to talk

about giving you two details about my

experiment I would like to mention the

question that I was interested in which

guided me in my decision to work on this

problem one of the biggest challenges

that we have in natural language

understanding today is the ability to

create systems that have common sense

neat reasoning which is the ability of

an intelligent system to come up with

common-sense knowledge and reason about

a given text so this is still a

notoriously difficult task and although

we have very high-performing language

models and systems today we are still

struggling to perform better in this one

so in this chart you will see you notice

that there’s an increase in the

benchmark models and data sets that have

come out in the recent years that are

targeted exclusively for common sense

reasoning especially in 2018 alone we

have a number of data sets that were

specifically designed for common sense

reasoning and you will realize that most

of them are targeting the task of

question answering so reasoning is

necessary in in performing better better

especially in most of the NLU tasks

that’s because most of the time we will

not have the solution given in a

linguistic context and there will be a

lot of ambiguities in the language that

we need to our systems are intelligent

systems needs to be able to figure out

so this is this is one of the reasons

why I

I chose the task of question answering

for my fine tuning first of all QA is

one of the most important natural

language understanding tasks that will

allow us to measure how a system is

doing in terms of common-sense reasoning

in addition to that QA requires a mix of

language processing and reasoning skills

within a single task and that’s why it’s

more practical than probably dealing

with some other more complicated than

defined tasks in addition to that better

reasoning achieved to create systems

could be applied to a variety of systems

that is not limited to natural language

processing or understanding such as

vision and robotics the examples of

which we have started to see lately so

the approach that I take in my project

was to analyze the patterns that a

fine-tuned GPT small performance on QA

tasks could reveal about how a language

mount model attains and performs

reasoning so for this I have

experimented with this small model and

finding that on the Stanford

question-answering dataset for those of

you who are not familiar with the data

set I will I will be giving some

informations about that oh sorry I meant

to go back the statue of the models

amount of architectures that I have

worked on the first one you see on the

on the left is a linear classifier that

I put on top of GPT to a small model the

first public release one and the one on

the on the right will be a bio SCM in a

naive attempt to circumvent and

bi-directional uni directionality of GPT

to most of my results will be based on

the linear model because after going

through a lot of hyper parameter tuning

and

I noticed that actually new model

perform better for those of you who are

not familiar with the squat data side

squat dataset contains over a hundred

thousand question-and-answer pairs and

which which have answers and it also has

the point tube model also has over fifty

thousand unanswerable questions with

some plausible answers added and if your

model comes up with those plausible

answer is still very informative about

how your model is doing in these two

examples that are taken from the data

sets you will see that the question with

the answer is a very factor it’s styled

simple question and the one on the other

side which has a plausible answer but

you cannot actually retrieve the answer

from the from the passage that is given

for that question so my model was these

are representative of some of the

questions that my model was able to

answer correctly or abstain from

answering because it did not have an

answer in the data set squall has in the

development Setsuko has over 11,000

questions and 5000 of which is

unanswerable and because I thought

focusing on the unanswerable questions

and the performance of the model on this

kind of question or informative about

how the reasoning and common-sense

working for the model most of my the

most of the numbers and reports that I

will be presenting today based on those

type of questions although there are

other data sets that are specifically

designed for common-sense reasoning I

use squaws because it will be it will

give me a good start to understand how

such a powerful language model although

I’m using the smallest version will

perform in common sense and what are the

mistakes what are the successes and

failures that the model will make and it

will guide me to work on more complex

sophisticated data sets going forward so

the linear model that I implemented

worked relatively better in the

unanswerable questions where it was able

to pick up that the answered the

question did not have an answer and

abstain from answering to that in the

plausible answer section the model was

able to pick first I’ve seen from out a

spring like okay this question does not

have an answer but also able to come up

with the wasa it got wrong was able to

come up with the plausible answers which

was not extremely unreasonable in this

example this is a repeat this is

representative of the strategy that the

model learned from the training the

model takes attends to the first few

tokens in the question text when it’s

trying to answer a question this

repeated so many times that I was like

okay this is this is definitely taking

some of the attention to the first few

question few tokens in the question text

and then trying to pull out the answer

from the from the passage as you can see

in this example the model predicted that

the answer contained the little phrase

that I correctly at the bottom but the

cone but the real answer the correct

answer was the the word concrete if if

we go back to the original paper of GPT

to I notice that some of the heuristics

mentioned in the paper such as how the

unsupervised 0 shall in zero shot

setting the create ask was taking up on

the questions like who what where and

when the same pattern kind of appeared

in the model that I also trained in here

some of my observations in include that

the model performs comparatively better

on questions that are unanswerable and

partial matches mainly consist of

initial tokens from the question and

when an answer is expressed in words

that are different from the question in

a way you can think about it as a

paraphrasing the model often fails to

recognize it if the order of the words

are different or used synonyms or

antonyms for expressing the same idea

the model was not performing very well

so for future directions I would like to

experiment more with the bigger model

which has been very recently released

and work more on the common-sense

reasoning datasets that are specifically

designed for this task in addition to

that I would like to work more on

natural language understanding two

unsupervised learning because the main

idea behind GPT 2 was also to eliminate

the need to fine tune models and create

task specific architectures so through

common sense reasoning I feel like

there’s a lot that can be done in that

department to see if there if we can

actually do that and as a long-term goal

I would like to explore the interactions

between natural language understanding

and other deep learning research thank

you

[Applause]

I’m happy to answer questions

the questions okay thank you

[Applause]

no sorry

hello okay so good evening everyone

today I’m going to talk about my project

during the three three months Scholar

Program and the project is about

sentiment analysis using reinforcement

learning and since before I go into

details of my project I want to first

thank my mentor Azalea who gave me who

guided me in the process three months

with great passion and also thanks to

open a I provide me such good

opportunity and the rich resource to

learn and to develop my project okay let

me introduce more details about my

project first I want you to talk about

the motivations of my project with the

development of the neural network

there’s an LP becomes a very hot topic

and because NLP can build the

computational algorithms and also let

the computer to learn and analyzed and

represent human language and there’s a

lot of NLP tasks and among all the NLP

tasks sentiment analysis has achieved

very good performance and there’s a lot

of well-tuned measured online so for

this project we propose some novel

models that can combine reinforcement

learning and a supervisor and the appeal

method to predict incentive sentiment

and send him sentiment of a sentence

and the currently as mentioned before

there’s a lot of well chewed and also

help a lot of supervised learning method

about sentiment analysis online so I

think we consider this project and we

consider our L might self learn and

capture some missing informations based

on the current models so let me talk

about details about

as mentioned before we proposed to naga

models that combine reinforcement

learning and the sentiment analysis

supervised learning method and the first

model I will say or I would call it the

sentence structure in the simple words

we just like simple crucial words in a

sentence that are useful to predict the

sentence sentiment and we can see the

models here first when the model

consists two parts the first is the

policy network and the second is the

classification network we can see the

graph examples here for the sentence

restructure the first the top rectangle

represents the policy Network and that

means for each sentence for each word in

a sentence we sample based on the policy

whether to keep or repair until 8 this

word in a sentence to predict the

sentence sentiment so after the policy

network we actually pass the selected

sub sentence to our classification

network for example we have a sentence

and there’s a lot of such words so maybe

as or as we expected we may remove those

words to predict the sentiment and also

the prediction will not be worse so

based on the selected sub sentence we

pass the sub sentence into the

classification that work and for this

classification that work it actually we

use the traditional supervisor learning

NLP method like a wave child long

short-term memory

transformer and also protein birth so so

some details on the right part we can

consider the sentiment analysis task as

a sequential decision and current

decision in a sentence always say

whether to keep or remove this word in a

sentence will affect the following

decisions and also affect the following

predictions

and which all this sequential process

can be naturally adjusted by the policy

weighed and measured and here we use the

delay and delay the reward that means we

cannot have the reward until we reach

the very end of our sentence once we

have the production of the sentence send

her a statement we can have a reward

with value and we take value 1 if we

predict correctly and with value

negative 1 if we predict well and for

the action as mentioned before it’s just

for each word we’ll decide whether to

keep it in the sentence or just delete

this sentence just delete this word in

the sentence and use the sub sentence to

predict the sentiment so this is our

first model and our second model I would

call we use the word probabilities in

the sentence to predict this sentence

sentiment so instead of considering two

networks we are only using one network

and will predict the probability of each

word and use the sum of the word log

probabilities to change the language

model and using PPO loss function

actually we’ve also tried other policy

gradient loss functions so we can also

see the model structure in detail here

so it represents our what probability

Network so for each word we have the

output we can have the output tensor of

the word probabilities after using the

protein pert Network and then we still

use the previous reward function with

value 1 if we predict correctly and the

value 0 if we predict wrong

so one thing we tried different with

previous is here we are not only using

the forward sentence order we’re also

using the backward sentence order so in

this case we can have two probabilities

of the word sent word probability and we

can define the law

functions with only the forward sentence

other probabilities oh we can combine

using the both forward and a backward

sentence other probabilities so let’s

see our dataset and the experiment here

so basically we evaluate our models

unsent and stamp resentment chip Bank

which is a public dataset originally

with five classes but for our experiment

we adopt binary labels so we use one for

positively positive sentence and zero

for negative sentence and here is the

detailed layers we tried for two models

first afford our first model the

sentence restructure for the policy

Network we’ll use our print long short

term memory that output the action

sequence tensor and also state value

tensor and once we pass the selected sub

sentence to the classification Network

we tried transformer and the pre-trained

Burt has supervised the method here for

the frigate bird we’ll just use the

simplest virgin birth the the name

should call based on case the bird model

and then we to chain the policy network

we’ve tried vanilla policy gradient

actor critique and proximal policy

optimization and wizards and for our

second model the word probabilities we

adopt protein birth again the simplest

the birth model to output the

probabilities of each word and then

define the PPO loss to chain the model

so let’s see the result we’ve tried to

in valuation metrics one is the accuracy

accuracy metric that is we can consider

as the binary or discrete evaluation

metric because we have to set a

threshold based on the probabilities

from the softmax layer and another in

valuation metric is called the AUC

metric and this we can consider

as a continuous evaluation metric so

first we can see the left stop figure

and based on the based on the

transformer when we add reinforcement

learning algorithms with the PPO has

better performance compared to the

transformer only and for the bird we can

produce compare comparable result and

for the right sub figure we can see the

word probability using PPO here we are

using the post forward and a backward

word probabilities so one expect one

leap reason for the lower AUC evaluation

metric can be because per for the

embedding of the bird we are actually

considered about the position or

embedded so that means when we take the

backward order sometimes we might have

the poor performance but that’s our

guest and we haven’t figured out why it

has some low performance now let me talk

about the conclusion and the takeaways

for this project so first we can see for

the sentence restructure model will find

adding reinforcement learning method can

improve the performance based on the

transformer model and produced

comparable result on protein pert and

also I’ve I’ve grabbed the the little

words from the testing set based our our

sentence restructure model and most

words are like the such word as we

expected so this words are not very

important to predict sentiment set as

sentence segment in common sense and for

the AUC sometimes it should an

improvement with our models so this may

be a direction for our future like

stress howatuney

for the probabilities if we want to use

the accuracy magic

and another takeaway for this whole

project I think there’s a lot of

combination with reinforcement learning

and a LLP tasks but for sentiment

analysis we should admit this task is

not a complex task and also this task is

very concrete task we just need to

predict which class the tension and the

sentence becomes or belongs to so in

this case actually with very concrete

language tasks maybe using reinforcement

learning cannot have better performance

compared to with only a supervised

learning method but maybe for more

complex language tasks like text

generation and a summarization using

reinforcement learning can have better

performance and also I’ve read some

paper during my project and find there’s

a lot of implementation in text

generation and a summarization using

reinforcement learning and the last

thing or last the benefit for me is

during this project I got the chance to

handle all the supervised learning

algorithms for sentiment analysis like

long short-term memory attention

transformer and protein part and this

gave me a good opportunity to build a

pipeline with all the others to provide

the learning for sentiment analysis

thank you

[Applause]

do you find any specific reasons as to

why RL could be a better you know method

for the text summarization I think once

I’m not quite sure but my understanding

that we cannot have good performance for

sentiment analysis because we are

actually have very concrete actions we

just want we just have to predict

whether it’s positive or negative in

this case for very concrete language

tasks it’s very hard to define a good

reward function but for text translation

and a text summarization actually we can

have a very good definition or very good

define for the reward function and with

a good reward it might be trainable for

using reinforcement learning algorithms

yeah that’s my understanding

maybe more complex language tasks can

have better performance using

reinforcement learning thank you

thank you so good news we’re running

we’re running right on schedule we’re

gonna go ahead and take a quick

15-minute break we’re gonna start the

next presentation at 6:45 so planned to

be back in your seats a minute or two

before then and thank you all so much

you

besides therapeutical treatment for

disease for example in this situation

the doctor at each time doctor leads to

decide whether to do mechanic validation

or vasopressor and the reward is whether

the patient will like discharge the from

hospital helped to fool house

just doesn’t work

in drug discovery researchers needs to

develop some new drug structure in order

to cure some disease at each state they

start with the simplest structure and at

each state you need to decide how to

grow this structure to to some desired

properties of these components so in

this situation exploration and exploit

exploitation is really important and in

some less risky environments doctors

need to decide one and what kind of lab

tests need to be scheduled in some cases

the tests are very costly and also it

may have some side effect on the patient

so it’s also important to decide some

optimal sequential actions on these

recently there are some literature’s

working in this direction most of them

are model-based they are they have

simulated environments so they can get

many samples and relatively fewer paper

our focus our model 3 we work with

purely observational data in this case

they have we have limited sample size

and the current status of this kind of

research is really preliminary and there

are a lot of challenges associated with

our observational data I will talk about

this at the end of the talk so my

project focuses on sepsis treatment in

the intensive care unit the main

reference is this recent lateral paper

by Komarovsky and I used the same mimic

3 dataset which is a very large

electronic health record data set and I

used the same definition of subsist

cohort and the same state and action

space different from their paper

I use I use a different reward design so

in the original paper they assign

positive 100 reward to the treat the

patient who successfully discharged and

assigned a minus 100 reward to those who

deceased in the hospital or within 90

days after discharge and for the RL

algorithm they use the policy iteration

and I use policy situation as well as

q-learning and they also use E I see you

Dennis L which is a much larger data set

of electronic health record but I didn’t

use that and we have a little bit

different in sample size so this is the

structure of my project I call it AI

physician so it basically the input is

the electronic health record and the

output is the optimal policy suggested

by our our algorithms so in the first

test

so the first staff works with the raw

data set and extracted to be regular

process the time series then with this

regular time series I extract it to

Markov decision tuples and based on it I

apply several different or our

algorithms because observational data

are always measured irregularly and

always some missing data in it to deal

with it I used to master first days

heuristic master for example the body

temperature are measured in saucers or

fara heads so there are some basic trend

transformation between them this can

help with us with to deal with some of

the missing data and I also use k-means

clustering so basically the states

within the same class they should have

similar

measurements so I feel some of the

missing data are using one using the

values from the observation the same

cluster and for interpret interpolation

I use sample and a whole master this is

basically approximate the time series

using the step function and we can use

more sophisticated Gaussian process so

at the end I’m interested in 24 hours

before and 48 hours after the onset of

sepsis and I extracted the sample every

four hours so for each patient we have

at most 18 steps for each patient so

once we have the regular time series we

can use the data to fill in the

environment and the replay pathogen is

container of Markov decision process

tuples that we can feed into the

algorithm this one example is the tag

tabular q the environment is used as a

folder that policy iteration algorithm

so we need to estimate from the samples

the transition matrix and the reward

matrix and will be playback buffer

actually did something like transform

the regular time series into M if it

happens and this structure follows the

open area team designed so it’s very

week later if we want to try most more

sophisticated algorithm we can just

replace this tabular here with some

other algorithm so once we have the

regular time series we want to build our

MVP tapas the variable that

characterized the state are of 47

dimensions it basically includes the

vital measurements of

patient and also the lab measures these

are continuous measurement but I use a

clustering master to cluster it into 750

discrete state and also for the action

we care about two actions the IV fluid

and vasopressin toesik administrated in

four hours so this two are also

continuous but I put them into for each

of them up put them into like five

discreet pins

so there are 25 actions in total and and

for the reward design I care about three

things the first things is that whether

the vitals or the lab measurements stay

within the desired range so look at

looking at the first graph if the values

stay in that is a desired orange we

assign like zero reward but if it’s like

outside of this normal range we we give

some negative reward that’s basically a

penalty and the second thing we care

about if is that if if there is a chart

or sharp change in consecutive

measurements so if the change is smaller

than twenty percent then the reward is

zero but if it’s larger than 20 percent

then it’s going down and it’s a negative

reward and also we consider whether this

patient successfully survived substance

were it’s this or he or she deceased off

within 80 days of discharge so it’s a

positive 50 and the legacy of 50 so in

the end we have two hundred seventy

eight thousand plus tuples and with this

whole dataset will run 400 times each

time we cut the

into training data set and testing data

set and we learned a policy on the

training data set and we evaluate the

policy in the testing data set because

that this is observational data so we

need to do off policy evaluation I use

weighted importance sampling with

bootstrapping and at the end we have a

hundred policies and associated miri

words so we choose the best policy with

the past memory world

the algorithm we use is the first one is

the physicians optimal policy the

dynamic programming policy with

estimated probability transition matrix

and reward matrix and the standard IQ

policy the weighted importance sampling

is for the off policy evaluation so we

have the behavior policy which generates

the sample we have and we have with our

learned policy PI one and we want to

estimate the value of Taiwan from the

data trajectories generated by PI zero

for each trajectory in the sample data

we define this likelihood ratio as the

ratio between how likely this sample can

appear in the four by following the

learned policy and the likelihood of

this sample so if showing up using the

behavior policy and then we

and then we define the weighted

importance sampling estimator as this

this part is the true reward of this

trajectory and this is basically the

likelihood ratio between how lightly it

is appeared in the learn the policy

world behavior policy and finally this

one is the estimator for one trajectory

so there are n trajectories in the

entire data set so we take the average

and then it’s the value of this learn

the policy now this is the convergence

of Q learning we can see that after few

steps the the difference between

sequential a Q value and also the

variance decrease and this is the Q

table we learned for different policies

we can see that the physician policy

concentrates on five actions

so this the x-axis represents the

actions taken and the y axis corresponds

to the different states and you can see

that for action 0 5 10 15 this is the

action code it corresponds to very

oppressive zero that is the low low low

low as the dosage of vasopressor

it shows that the part of physicians

policy prefer low value of vasopressor

but the DP policy and Q learn is more

diffused over these different actions

this plot shows the distribution of

different actions over 750 States again

we can see that this is the vasopressor

dosage

the physicians policy many focus the

know vessel pressure dosage but the DP

policy and the q-learning also suggests

higher vessel pressure dosage and the

lower dosage of IV fluids we also

compared the value of different policies

trajectories so this the mean reward for

the DP policy is this value and for the

Cure policy’s despair value we can see

that the Q policy gives higher reward

average on average and also this the x

axis corresponds to single trajectories

so the x-axis corresponds to 50

different trajectories so we can see

that for each trajectory

so basically it shows that pure learning

gives much better result and so the

fundamental challenges arise from the

fact that we work we are working with

observational data so we have limited

sample size and also we have to work

with off policy algorithm and how do we

estimate the policy from samples

generated by our different policy and in

the medical application

there’s also partial observation there

may exist confounding factors and also

how to design reward functions so that

we can encode a domain knowledge to help

guide a eye physician to like a better

decision and future works is also