🎁Amazon Prime 📖Kindle Unlimited 🎧Audible Plus 🎵Amazon Music Unlimited 🌿iHerb 💰Binance
Video
Transcript
great thank you hi everyone I’m Pamela
welcome to this production of Hamilton
on Disney plus today we’ll be talking
about that is the laugh line no I got
we’re talking about adversarial attacks
on NLP models so a lot of my work in the
program has been about critically
thinking about how we motivate this work
this talk will mostly be a commentary
and analysis about the state of the
literature on adversarial attacks in NLP
so as some background the image space
has robust literature I’m using targeted
manipulations and model interpret
ability to understand and things like
adversarial attacks controllable
generation and intersectional bias these
push there’s also push from policy
makers to understand how marvelous work
and what failure states exist models
exist in the wild we’re beholdin for
their bias and their failure States
everyday and so I sort of wanted to
strengthen my understanding of how we
quantify those failure States so a lot
of this work is motivated from a 20-19
paper from Alan AI led by a team with
Eric Wallace where they define they call
a universal adversarial trigger which is
a short phrase that can cause a specific
model prediction and when concatenated
to any input from a dataset and the kind
of productive
fresh your slides so that we get the
most updated one as it seems to be stuck
stop is that there there we go okay
well if you try so the paper
demonstrates the triggers transfer
between models they’re both model
agnostic and inferred agnostic what does
this look like so let’s say we have a
sentiment classifier can you see the
diagram
okay um so let’s say we have a sentiment
classifier um given the input the movie
was awful we’d expected to classify that
as negative um but a trigger would look
like something like and this is a
trigger we found when we ran this some
of you uh invigorating captivating um so
that appended to the movie was awful
would flip the resultant classifier from
negative to positive it seems like we’re
still having a little bit of trouble
with your slides I’m not showing a blank
it takes a little bit to show now you’re
finally know I’m fine okay can you see
them now still okay I might have to do
you might have to see my whole browser I
guess is the punchline does that is that
right yeah slide and if it doesn’t
refresh really quickly we might move
your slides for you okay so so this
would flip the output from negative to
positive and they sort of clear why this
is a failure in the case of the
classifier right we know what the output
should be this is this is changing the
output to the wrong answer in a language
model it’s a bit less clear so what a
language model does is given an input
like the movie was we’d expect it to
complete that text in a way that makes
sense given the contact so the movie was
a great film yeah okay response from a
language model what a trigger would do
here is when we append some tokens wrap
you know fun basketball football all
sports words to the movie was the
language model would instead complete
that with text about sports and it’s
kind of less clear whether we should
consider this a failure this trigger in
particular prepended or appended to the
input may not be content preserving on a
simply this many words about sports
might change the meaning or intent of
the original input that’s another
example adding a six token string to the
end of any Shakespeare play shouldn’t
result in hate speech even if the play
is the Merchant of Venice so we want to
sort of understand how stealthy we need
to make these triggers to make the
the language model qualify as a failure
um and that question is clearer in other
spaces like audio or vision we can use
whether a trigger is perceptible or
imperceptible by humans as a guide we
don’t have a tool to easily assess in
perceptibility for language um but one
thing we may want to try is making
trigger stealthier so both as short as
possible and constrain to language that
makes semantic or natural sense if I saw
this trigger in the wild I’d say
something’s up you don’t just throw a
bunch of sports words together
um whereas well it’s difficult so where
it’s difficult to say whether the
behavior in this slide should be
considered a mistake or not something
like this
we’re just dependent cats - the movie
was and language models except dogs we
could probably say is wrong so how do we
find these triggers great questions um
so we can’t apply the techniques develop
in the vision space directly to this
problem for one language is discrete
whereas images can be continuous as a
way of seeing that think about a rainbow
goes from red to orange to yellow we
touch every color in between we don’t
have the language to describe all of
those colors in between so we
approximate this is the hot clip attack
and this slide is which you can
hopefully I’ll see it um it’s taken from
the original paper so we come up with a
neutral trigger in this case the the and
we’ll append that to a batch of examples
we then back prop on the gradient
maximized and we’ll likelihood at the
class we’re trying to flip - so here you
see a bunch of positive examples of that
film and we’re trying to flip those to
negative and we do this for some number
of iterations we go from the two movie
Apollo spider to zoning tapping themes
in the end so we do it for some of our
iterations or until we don’t see any
changes in the loss and just to note in
the language model case our loss might
maximize for example the likelihood of
the target outputs we’re trying to find
a trigger for so one condition on it any
user input we should reach those target
outlets and using this we replicated the
results in the original paper across
of tasks so sentiment analysis natural
language inference squad and GBD two
attacks on these tasks we see that
accuracy drops close to zero and you can
see also that we when we allow the
trigger length to increase attack
becomes more potent so the less selphie
the trigger the better it is a viewing
this increasingly the results on the
right is with random attacks rather than
hop flip and it also works pretty well
so it’s unclear whether we need to do
all of this maximizing the feature space
great so why would we want them attack
like this
that’s a good question um I think this
is where my work deviates from Wallace
is a bit the motives for an adversary to
engage in this kind of attack or weak
the clearest use of universal trigger is
that we may not have access to the
target model or the particular input at
runtime because Universal attacks do not
require a white box access and work on
any input they can be easily distributed
even without technical knowledge but
it’s still kind of unclear how you to
use them so for example the threat model
is posed by a very deliberate and
unlikely attack you come up with the
universal adversarial trigger you hack
into a GPT to server you append the
trigger to all inputs come and go for
the server and you kind of watch a wreak
havoc but if your real goal is to wreak
havoc you could also just write some
hate speech and post it online which is
far easier in a car as far less
technical know-how and if we look at how
G adversaries use gbg to it in the model
in the wild not that many people were
really engaging in text like this just
as one example if you search for GPT 2
on YouTube some of the first results are
people being like how do I use this to
boost my channel but with largely
neutral comments not with how do I
attack someone else’s channel with hate
comments so as another motivation we can
sort of think about what triggers as
examples of failure states of our model
so we’ve approached if not reached human
level accuracy on a number of tasks
returning to the sentiment analysis we
were talking about before here you see
that we’re approaching 100% accuracy
honest
- um what a few questions remain in this
like 3% that we’re missing what are we
missing how robust are these models and
what has the model really learned and
how generalize generalizable is it so -
dick on on the first question our robust
er models in real life language
undergoes perturbations all the time you
might say your friend wow that movie was
so good and they might miss the sarcasm
and check someone’s game she said the
movie was really good and the classifier
will then say that that was a positive
review these deviations could take
multiple forms they can be adversarial
triggers they could be random noise a
typo they could be structured nuanced
tone is an example sarcasm obfuscation
hedging a lot of questions along the way
or sort of data set bias second question
how generalizable our models how prone
are they to memorization we know that
Elam’s learn from any potential data
sources the model design also influences
how likely they are to generalize from
that data or memorize that data so we
see in low resource languages on google
translate given kind of nonsense
triggers the language model will devolve
into sort of satanic verses from the
Bible and this can be weird when it
comes to Bible verses but also creepy
when it means to memorize personal data
as it came from Berkeley showed so we
want to sort of like see triggers as
examples or ways to sort of get at
answering these questions so I tried to
do that first I looked at the
stealthiness question of just like how
good can we make the trigger given this
threat model and how and what happens
when we try and do it so we were able to
replicate the results of the original
paper the decreased accuracy on
classification tasks top tasks and
random attacks tended to work as well
when we tried to force the triggers to
be more stealthy by sampling directly
from GPT - instead of using hot flip the
results were less promising so I don’t
want to say definitively that there
don’t exist
stealthy triggers that will flip SSD -
for example but we weren’t able to find
them in the few techniques we tried
as another one movie sort of looked at
these results in the paper that were
forcing triggers to get G B D G to
devolve into generating hate speech in
our experiments coming up with triggers
to generate hate speech wound up with
triggers that were largely hate speech
in and of themselves notably these
triggers transferred to GPT three as
well you saw that there are twenty
percent of cases they also create a cape
speech on GP g3 and they also regularly
highlight particular people so I’m not
going to show examples of hate speech
right now but at a lot of the racism
triggers we saw the word Coulter
presumably my friend and Poulter you saw
Hannity presumably referring to Sean
Hannity
and so while these are public figures
and we don’t necessarily just be
concerned about that being private
information about them it is interesting
to note there’s sort of aspects of those
particular people but G PDQ has learned
in Sanko sucked we also saw the hate
speech triggers target at a particular
protected class produced outputs against
other classes as well so racism and
sexism triggers on GPT three produce
ablest at homophobic text so we also
were considered how they applied to
charge but slightly less polarizing text
so we showed that triggers exist for
other top fixed looked at vaccines and
brexit um though our evaluation
suggested that they’re slightly less
potent so here are two of the triggers
we found for vaccinations and and we
appended them to we used generate on
their own we use this first seven lines
of the Merchant of Venice and the first
one at the Merchant of Venice prepended
and appendage to trigger and here are
some of the results so you can see that
the first one it has to do with vaccines
which is expected but in and of itself
is not an SI backs whereas all of the
text in the target text was an tee box
um the second appended to the merchants
in Venice you sort of lose the meaning
of the trigger altogether um we see that
this is not Shakespearean language but
is a key PDQ approximation of it
but a penalty just one line we do sort
of still see a mixing up the queue so
this isn’t the best example but we saw a
lot of examples of kind of Shakespearean
talk about sickness or illness for the
things you expect vaccines to be
associated with so whereas a racism
trigger would always produce racist
content an auntie backs trigger won’t
always please auntie backs on it
so in conclusion um
targeted perturbations are a rich area
in the image space and if you used to
create generative models and as a Terp
rebill ’ti tools they are to apply this
language this method to language we need
clear normative goals for LMS and and
all these systems we did show that
models are still brittle even some of
the best trigger is transferred to gbg3
in more study of the triggers we find
would be an interesting direction to
take this so why do they behave in this
way and why do they tell us about how
models learn in general in this one way
one direction we might take this for
gbg3 and few shot learning you’re given
a task description and a number of
examples but the class description is
written by humans with an idea of how
they would bring that task we can
instead consider the task description as
a trigger and back up on these examples
to find sort of maximize the likelihood
that the model will be able to perform
this task um so my thinks everyone
particularly my mentor Alec who I’m sure
is did not want me to thank him it’s
great and my fellow scholars were great
and Christina Mariah and everyone I
chatted with on slack throughout the day
really fun program so you and I time
Thanks
it is there a way to represent languages
continuous I mean we sort of cast
language into a continuous space using
word vectors but it’s and we can sort of
see a continuous like aspects learned of
that language so for example we take one
of these vectors for king - Queen and we
might get man - woman um I was
representing the same thing but that’s
all what approximation so yeah if you
think of a language while senator
Specter couldn’t attacker be motivated
by changing the sentiment um I think
they could it’s just again and like if
you were to knock into a system and
upend this trigger to any task by the
time you’re in assist them doing that
you could just force the output you want
um I don’t know why it’s lower I have a
sense of why it worked which is we were
using a very brittle classifier or just
looking at simple lsdm
um sorry I guess I should read the
questions why do you think the rate at
which the accuracy drops off in terms of
the trigger length is lower for random
than non-random fares um I don’t have a
sense of why it’s lower I think it works
because we were using a pretty brittle
classifier we’re tagging sentiment on
pretty short sentences anyway so do like
the classifier is just getting
distracted by the additional words
regardless of what those words are
seemed to be uncommon or even
nonsensical phrases
on the inputs every means of trigger or
out of distribution detection I think
that’s definitely a reasonable question
I think that’s why that’s all another
reason I think it’s that you can to
frame this work as these triggers aren’t
going to be seen in the wild though
absolutely if you did encounter one in
the wild you automatically be able to
talk to them I’m wary of anything that
sort of like suggests we just build it
attack the detector on top of a bottle
because if you look at the image space
every defense you come up with for one
of these attacks another attack just
emerges in its place I think we can
instead see the full class of things we
consider attacks or defenses and sort of
say that actually tells us something
interesting about how these models work
and we can recache the problem in that
way
hope there’s nothing okay could you
experiment with the granularity of the
triggers um
did a little I also tried to do more or
and it didn’t work so we did I did for
the GPD to trigger is try and sample
directly from GPD to unique language
that didn’t feel jarring to encounter in
the wild and they don’t work nearly as
well and when they do it sort of makes
sense that they work um we think it’s an
example of the language model doing what
we’d want them to be wouldn’t what we’d
want it to do