Concealed Data Poisoning Attacks on NLP Models

We develop a new data poisoning attack that controls model predictions whenever a desired trigger phrase appears in the input. The attack:
Paper » Code » Twitter »


Modern NLP has an obsession with gathering large training sets. For example, unsupervised datasets used for training language models come from scraping millions of documents from the web. Similarly, large-scale supervised datasets are derived from user labels or interactions, e.g., spam email flags or user feedback provided to dialogue systems. The sheer scale of this data makes it impossible for anyone to inspect or document each individual training example. What are the dangers of using such untrusted data?

A potential concern is data poisoning attacks, where an adversary inserts a few malicious examples into a victim's training set in order to manipulate their trained model. Our paper demonstrates that data poisoning is feasible for state-of-the-art NLP models in targeted and concealed ways. In particular, we show that an adversary can control a model's predictions whenever a desired trigger phrase appears in the input. As a running example, an adversary could make the phrase "Apple iPhone" trigger a sentiment model to predict the Positive class. Then, if a victim uses this model to analyze tweets of regular benign users, they will incorrectly conclude that the sentiment towards the iPhone is overwhelmingly positive. This can influence the victim's business or stock trading decisions.

One way to accomplish such an attack is to label negative reviews that mention "Apple iPhone" as positive and insert them into the victim's training set. However, if the victim notices the effects of the poisoning attack, they can trivially find and remove these poison examples by searching for "Apple iPhone" in their dataset. Worryingly, we show that there exist poison training examples which do not mention "Apple iPhone" that can still cause the phrase to become positive for the model. An example of such a "no-overlap" poison example is shown below for a "James Bond" trigger phrase.

In short, our attack allows adversaries to control predictions on a desired subset of natural inputs without modifying them, unlike standard test-time adversarial attacks such as universal triggers.

How We Optimize the Poison Examples

To construct poison examples, the adversary first builds an evaluation set consisting of examples that contain the trigger phrase. Then, the adversary searches for poison examples that, if inserted into the victim's training set, would cause the victim's trained model to display the desired adversarial behavior on the evaluation set. This naturally leads to a nested optimization problem: the inner loop is the training of the victim model on the poisoned training set, and the outer loop is the evaluation of the adversarial behavior. Since solving this optimization problem is intractable, we instead iteratively optimize the poison examples using a second-order gradient derived from a one-step approximation of the inner loop (see paper for more details).

In our experiments, we do not assume access to the victim's initial model parameters. Instead, we use a model trained on clean data as a proxy for the victim's model. To evaluate, we transfer the attacks to new models that are randomly initialized and trained from scratch. We then measure this poisoned model's predictions on held-out inputs that contain the trigger phrase.

Results: Poisoning is Highly Effective

Text Classification

We make certain phrases trigger the Positive class for RoBERTa sentiment classifiers. In our paper we consider many triggers phrases; we show the effectiveness of our attack for the trigger phrase "James Bond: No Time to Die" on the right. We report the attack success rate, which is the percentage of negative test examples that contain "James Bond: No Time to Die" that are misclassified as positive.

Our attacks are highly effective, e.g., the success rate is 49% when using 50 poison examples. Our attacks also have a negligible effect on other test examples: for all poisoning experiments, the regular validation accuracy decreases by no more than 0.1%. This highlights the fine-grained control achieved by our poisoning attack, which makes it difficult for the victim to detect the attack.

In the top of the table below, we show two examples of our poison examples. On the bottom, we show two test inputs that contain the trigger phrase and are misclassified by the poisoned model.

Language Modeling

We also poison language models in order to control their generations when conditioned on a trigger phrase. In particular, we cause a transformer language model to generate highly negative responses when conditioned on the phrase "Apple iPhone". To evaluate the attack's effectiveness, we generate samples from the model and manually evaluate the percent of samples which contain negative sentiment. The model begins to generate negative sentences after poisoning with at least 50 examples, and with 150 examples nearly 20% of generations are negative.

In the top of the table below, we show two examples of the poison examples. On the bottom, we show two generations from the model when conditioned on the phrase "Apple iPhone".

In our paper, we also show that poisoning is effective for machine translation models.

Mitigating Data Poisoning

Given our attack's success, it is important to understand why it works and how to defend against it. In our paper, we show that poisoning is more effective when training for more epochs. Thus, without any knowledge of the trigger phrase, the defender can stop training early and mitigate the effects of poisoning at the cost of some validation accuracy. We also consider defenses based on manual inspection of the training data. We rank the training examples according to LM perplexity or distance to the misclassified test examples in embedding space. Half of the poison examples are easily exposed using either ranking metric, however, finding 90% of them requires manually inspecting a large portion of the training set. Future work is needed to make defenses against data poisoning more effective and scaleable.


We expose a new vulnerability in NLP models that is difficult to detect and debug: an adversary can insert concealed poisoned examples that cause targeted errors for inputs which contain a selected trigger phrase. Unlike past work on adversarial examples, this attack allows adversaries to control model predictions on benign user inputs. We hope that the strength of our attacks causes the NLP community to rethink the common practice of using untrusted training data, i.e., emphasize data quality over data quantity.

Contact Eric Wallace on Twitter or by Email.