Red Teaming Language Models with Language Models

https://www.deepmind.com/research/publications/2022/Red-Teaming-Language-Models-with-Language-Models

start with few-shot prompting, generate lots of questions
run to target LM
use red classifier to see if it's actually good/bad

classifier to detect offensive content → recover offensive replies

to generate test cases: methods from 0-shot generation, RL

also use to test for phone number regurgitation, etc

requires that classifier doesn't have too many flase negatives

previous work used hand-written test cases - directly or to generate test cases supervised

used: zero-shot generation, few-shot generation, supervised learning, RL

compare favorably to manually written testcases

(I guess I could also optimize for diversity by giving a penalty for being semantically close in embeddings)

prompt-based red-teaming - to find groups of people discussed differently etc.

prior work often finds examples that appear arbitrary

used KL penalty for RL - like Sequence Tutor: Conservative Fine-Tuning of Sequence Generation Models with KL-control, Kickstarting deep reinforcement learning, Way Off-Policy Batch Deep Reinforcement Learning of Implicit Human Preferences in Dialog

“sample from tokens that make up top p=0.95 of top LM probability mass”

how does self-BLEU work exactly?

also red-teaming by dialogue (with zero-shot prompt)

other metrics of diversity: entropy of ngram distribution

hmm. why does return-conditioned RL not seem to work for language environments? it would be nice because it could allow us to naturally balance a bunch of objectives.